Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-14 Thread Jungtaek Lim
+1 (non-binding), thanks Dongjoon.

On Sun, Apr 14, 2024 at 7:22 AM Dongjoon Hyun 
wrote:

> Please vote on SPARK-4 to use ANSI SQL mode by default.
> The technical scope is defined in the following PR which is
> one line of code change and one line of migration guide.
>
> - DISCUSSION:
> https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
> - JIRA: https://issues.apache.org/jira/browse/SPARK-4
> - PR: https://github.com/apache/spark/pull/46013
>
> The vote is open until April 17th 1AM (PST) and passes
> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Use ANSI SQL mode by default
> [ ] -1 Do not use ANSI SQL mode by default because ...
>
> Thank you in advance.
>
> Dongjoon
>


Re: [DISCUSS] Spark 4.0.0 release

2024-04-14 Thread Jungtaek Lim
W.r.t. state data source - reader (SPARK-45511
), there are several
follow-up tickets, but we don't plan to address them soon. The current
implementation is the final shape for Spark 4.0.0, unless there are demands
on the follow-up tickets.

We may want to check the plan for transformWithState - my understanding is
that we want to release the feature to 4.0.0, but there are several
remaining works to be done. While the tentative timeline for releasing is
June 2024, what would be the tentative timeline for the RC cut?
(cc. Anish to add more context on the plan for transformWithState)

On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:

> Hi all,
>
> It's close to the previously proposed 4.0.0 release date (June 2024), and
> I think it's time to prepare for it and discuss the ongoing projects:
>
>- ANSI by default
>- Spark Connect GA
>- Structured Logging
>- Streaming state store data source
>- new data type VARIANT
>- STRING collation support
>- Spark k8s operator versioning
>
> Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
>
> Wenchen Fan
>


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-04-11 Thread Jungtaek Lim
t;>
>> Regarding the concerns about expertise in DRA,  I will find some core
>> contributors of this module/DRA and tag them to this email with details,
>> Mich has also highlighted the same in the past. Once we get approval from
>> them we can further discuss and enhance this to make the user experience
>> better.
>>
>> Thank you,
>>
>> Pavan
>>
>>
>> On Tue, Mar 26, 2024 at 8:12 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Sounds good.
>>>
>>> One thing I'd like to clarify before shepherding this SPIP is
>>> the process itself. Getting enough traction from PMC members is another
>>> issue to pass the SPIP vote. Even a vote from committer is not counted. (I
>>> don't have a binding vote.) I only see one PMC member (Thomas Graves, not
>>> my team) in the design doc and we still don't get positive feedback. So
>>> still a long way to go. We need three supporters from PMC members.
>>>
>>> Another thing is, I get the proposal at a high level, but I don't have
>>> actual expertise in DRA. I could review the code in general, but I feel
>>> like I'm not qualified to approve the code. We still need an expert on the
>>> CORE area, especially who has expertise with DRA. (Could you please
>>> annotate the code and enumerate several people who worked on the codebase?)
>>> If they need an expertise of streaming to understand how things will work
>>> then either you or I can explain, but I can't just approve and merge the
>>> code.
>>>
>>> That said, if we succeed in finding one and they review the code and
>>> LGTM, I'd rather say not to go with taking the process of SPIP unless the
>>> expert reviewing your code requires us to do so. The change you proposed is
>>> rather small and does not seem to be invasive (experts can also weigh), and
>>> there must never be the case that this feature is turned on by default (as
>>> we pointed out limitation). It doesn't look like requiring SPIP, if we
>>> carefully document the new change and also clearly describe the limitation.
>>> (Also a warning in the codebase that this must not be enabled by default.)
>>>
>>>
>>> On Tue, Mar 26, 2024 at 7:02 PM Pavan Kotikalapudi <
>>> pkotikalap...@twilio.com> wrote:
>>>
>>>> Hi Bhuwan,
>>>>
>>>> Glad to hear back from you! Very much appreciate your help on reviewing
>>>> the design doc/PR and endorsing this proposal.
>>>>
>>>> Thank you so much @Jungtaek Lim  , @Mich
>>>> Talebzadeh   for graciously agreeing to
>>>> mentor/shepherd this effort.
>>>>
>>>> Regarding Twilio copyright in Notice binary file:
>>>> Twilio Opensource counsel was involved all through the process, I have
>>>> placed it in the project file prior to Twilio signing a CCLA for the spark
>>>> project contribution( Aug '23).
>>>>
>>>> Since the CCLA is signed now, I have removed the twilio copyright from
>>>> that file. I didn't get a chance to update the PR after github-actions
>>>> closed it.
>>>>
>>>> Please let me know of next steps needed to bring this draft PR/effort
>>>> to completion.
>>>>
>>>> Thank you,
>>>>
>>>> Pavan
>>>>
>>>>
>>>> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> I'm happy to, but it looks like I need to check one more thing about
>>>>> the license, according to the WIP PR
>>>>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!a1C5BeYxzO7gVVrGZ56kzunhigqd4SeXMg3dHddtkIdIpO5UwFH3dxzNpK3bc53vuAkFYJ3goLU8Hxev8npLyDrA6JBQ8S0$>
>>>>> .
>>>>>
>>>>> @Pavan Kotikalapudi 
>>>>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>>>>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>>>>> Foundation.
>>>>>
>>>>> PMC members can correct me if I'm mistaken, but from my understanding
>>>>> (and experiences of PMC member in other ASF project), code contribution is
>>>>> considered as code donation and copyright belongs to ASF. That's why you
>>>>> can't find the copyright of employers for contributors in the codebase.
>>>>> What you see copyrights in NOTICE-binary

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Jungtaek Lim
Sounds like a plan. +1 (non-binding) Thanks for volunteering!

On Sun, Apr 7, 2024 at 5:45 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
> Dongjoon.
>


Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Jungtaek Lim
Sounds good.

One thing I'd like to clarify before shepherding this SPIP is the process
itself. Getting enough traction from PMC members is another issue to pass
the SPIP vote. Even a vote from committer is not counted. (I don't have a
binding vote.) I only see one PMC member (Thomas Graves, not my team) in
the design doc and we still don't get positive feedback. So still a long
way to go. We need three supporters from PMC members.

Another thing is, I get the proposal at a high level, but I don't have
actual expertise in DRA. I could review the code in general, but I feel
like I'm not qualified to approve the code. We still need an expert on the
CORE area, especially who has expertise with DRA. (Could you please
annotate the code and enumerate several people who worked on the codebase?)
If they need an expertise of streaming to understand how things will work
then either you or I can explain, but I can't just approve and merge the
code.

That said, if we succeed in finding one and they review the code and LGTM,
I'd rather say not to go with taking the process of SPIP unless the expert
reviewing your code requires us to do so. The change you proposed is rather
small and does not seem to be invasive (experts can also weigh), and there
must never be the case that this feature is turned on by default (as we
pointed out limitation). It doesn't look like requiring SPIP, if we
carefully document the new change and also clearly describe the limitation.
(Also a warning in the codebase that this must not be enabled by default.)


On Tue, Mar 26, 2024 at 7:02 PM Pavan Kotikalapudi 
wrote:

> Hi Bhuwan,
>
> Glad to hear back from you! Very much appreciate your help on reviewing
> the design doc/PR and endorsing this proposal.
>
> Thank you so much @Jungtaek Lim  , @Mich
> Talebzadeh   for graciously agreeing to
> mentor/shepherd this effort.
>
> Regarding Twilio copyright in Notice binary file:
> Twilio Opensource counsel was involved all through the process, I have
> placed it in the project file prior to Twilio signing a CCLA for the spark
> project contribution( Aug '23).
>
> Since the CCLA is signed now, I have removed the twilio copyright from
> that file. I didn't get a chance to update the PR after github-actions
> closed it.
>
> Please let me know of next steps needed to bring this draft PR/effort to
> completion.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm happy to, but it looks like I need to check one more thing about the
>> license, according to the WIP PR
>> <https://urldefense.com/v3/__https://github.com/apache/spark/pull/42352__;!!NCc8flgU!a1C5BeYxzO7gVVrGZ56kzunhigqd4SeXMg3dHddtkIdIpO5UwFH3dxzNpK3bc53vuAkFYJ3goLU8Hxev8npLyDrA6JBQ8S0$>
>> .
>>
>> @Pavan Kotikalapudi 
>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>> Foundation.
>>
>> PMC members can correct me if I'm mistaken, but from my understanding
>> (and experiences of PMC member in other ASF project), code contribution is
>> considered as code donation and copyright belongs to ASF. That's why you
>> can't find the copyright of employers for contributors in the codebase.
>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>> dependency and their licenses may require to explicitly mention about
>> copyright. It's not about direct code contribution.
>>
>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>> could you please engage with a relevant group in the company (could be a
>> legal team, or similar with OSS advocate team if there is any) and ensure
>> that CCLA is filed? The copyright issue is a legal issue, so we have to be
>> conservative and 100% sure that the employer is aware of what is the
>> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
>> and explicitly express that they are OK with it via filing CCLA.
>>
>> You can read the description of agreements on contribution and ICLA/CCLA
>> form from this page.
>> https://www.apache.org/licenses/contributor-agreements.html
>> <https://urldefense.com/v3/__https://www.apache.org/licenses/contributor-agreements.html__;!!NCc8flgU!a1C5BeYxzO7gVVrGZ56kzunhigqd4SeXMg3dHddtkIdIpO5UwFH3dxzNpK3bc53vuAkFYJ3goLU8Hxev8npLyDrAktmm6BY$>
>>
>> Please let me know if this is resolved. This seems to me as a blocker to
>> move on. Please also let me know if the contribution is withdrawn from the
>> employer.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
&g

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Jungtaek Lim
I'm happy to, but it looks like I need to check one more thing about the
license, according to the WIP PR
<https://github.com/apache/spark/pull/42352>.

@Pavan Kotikalapudi 
I see you've added the copyright of Twilio in the NOTICE-binary file, which
makes me wonder if Twilio had filed CCLA to the Apache Software Foundation.

PMC members can correct me if I'm mistaken, but from my understanding (and
experiences of PMC member in other ASF project), code contribution is
considered as code donation and copyright belongs to ASF. That's why you
can't find the copyright of employers for contributors in the codebase.
What you see copyrights in NOTICE-binary is due to the fact we have binary
dependency and their licenses may require to explicitly mention about
copyright. It's not about direct code contribution.

Is Twilio aware of this? Also, if Twilio did not file CCLA in prior, could
you please engage with a relevant group in the company (could be a legal
team, or similar with OSS advocate team if there is any) and ensure that
CCLA is filed? The copyright issue is a legal issue, so we have to be
conservative and 100% sure that the employer is aware of what is the
meaning of donating the code to ASF via reviewing CCLA and relevant doc,
and explicitly express that they are OK with it via filing CCLA.

You can read the description of agreements on contribution and ICLA/CCLA
form from this page.
https://www.apache.org/licenses/contributor-agreements.html

Please let me know if this is resolved. This seems to me as a blocker to
move on. Please also let me know if the contribution is withdrawn from the
employer.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
 wrote:

> Hi Pavan,
>
> I looked at the PR, and the changes look simple and contained. It would be
> useful to add dynamic resource allocation to Spark Structured Streaming.
>
> Jungtaek. Would you be able to shepherd this change?
>
>
> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni 
> wrote:
>
>> Thanks a lot for creating the risk table Pavan. My apologies. I was tied
>> up with high priority items for the last couple weeks and could not
>> respond. I will review the PR by tomorrow's end, and get back to you.
>>
>> Appreciate your patience.
>>
>> Thanks
>> Bhuwan Sahni
>>
>> On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
>> pkotikalap...@twilio.com> wrote:
>>
>>> Hi Bhuwan,
>>>
>>> I hope the team got a chance to review the draft PR, looking for some
>>> comments to see if the plan looks alright?. I have updated the document
>>> about the risks
>>> <https://docs.google.com/document/d/1_YmfCsQQb9XhRdKh0ijbc-j8JKGtGBxYsk_30NVSTWo/edit#heading=h.577aawlyiedf>.(also
>>> mentioned below). Please confirm if it looks alright?
>>>
>>> *Spark application type*
>>>
>>> *auto-scaling capability*
>>>
>>> *with New auto-scaling capability*
>>>
>>> Spark Batch job
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Streaming query without trigger interval
>>>
>>> No implementation
>>>
>>> Can work with this implementation - (have to set certain scale back
>>> configs based on previous usage pattern) - maybe automate with future work?
>>>
>>> Spark Streaming query with Trigger interval
>>>
>>> No implementation
>>>
>>> With this implementation
>>>
>>> Spark Streaming query with one-time micro batch
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Spark Streaming query with
>>>
>>> Availablenow micro batch
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Batch + Streaming query (
>>>
>>> default/
>>>
>>> triggger-interval/
>>>
>>> once/
>>>
>>> availablenow modes), other notebook use cases.
>>>
>>> No implementation
>>>
>>> No implementation
>>>
>>>
>>>
>>> We are more than happy to collaborate on a call to make better progress
>>> on this enhancement. Please let us know.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Fri, Mar 1, 2024 at 12:26 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi Bhuwan et al,
>>>>
>>>> Thank you for passing on the DataBricks Structured Streaming team's
>>>> review of the SPIP document. FYI, I work closely with Pawan and oth

Re: [VOTE] SPIP: Structured Logging Framework for Apache Spark

2024-03-11 Thread Jungtaek Lim
+1 (non-binding), thanks Gengliang!

On Mon, Mar 11, 2024 at 5:46 PM Gengliang Wang  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
> References:
>
>- JIRA ticket 
>- SPIP doc
>
> 
>- Discussion thread
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Gengliang Wang
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc
generation in Spark repo won't fail after the move of the js file. Other
than that, it would be probably just a matter of updating the release
process.

On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun  wrote:

> Okay, I see.
>
> Perhaps we can solve this confusion by sharing the same file `version.json`
> across `all versions` in the `Spark website repo`? Make each version of
> the document display the `same` data in the dropdown menu.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 17:09:07
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Let me be more specific.
>
> We have two active release version lines, 3.4.x and 3.5.x. We just
> released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact
> the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3.
> In the dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we
> call this as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown,
> giving confusion that 3.4.3 wasn't ever released.
>
> This is just about two active release version lines with keeping only the
> latest version of version lines. If you expand this to EOLed version lines
> and versions which aren't the latest in their version line, the problem
> gets much more complicated.
>
> On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:
>
>> Based on my understanding, we should not update versions that have
>> already been released,
>>
>> such as the situation you mentioned: `But what about dropout of version
>> D? Should we add E in the dropdown?` We only need to record the latest
>> `version. json` file that has already been published at the time of each
>> new document release.
>>
>> Of course, if we need to keep the latest in every document, I think it's
>> also possible.
>>
>> Only by sharing the same version. json file in each version.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月5日 16:47:30
>> *收件人:* Pan,Bingkun
>> *抄送:* Dongjoon Hyun; dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> But this does not answer my question about updating the dropdown for the
>> doc of "already released versions", right?
>>
>> Let's say we just released version D, and the dropdown has version A, B,
>> C. We have another release tomorrow as version E, and it's probably easy to
>> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
>> Should we add E in the dropdown? How do we maintain it if we will have 10
>> releases afterwards?
>>
>> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>>
>>> According to my understanding, the original intention of this feature is
>>> that when a user has entered the pyspark document, if he finds that the
>>> version he is currently in is not the version he wants, he can easily jump
>>> to the version he wants by clicking on the drop-down box. Additionally, in
>>> this PR, the current automatic mechanism for PRs did not merge in.
>>>
>>> https://github.com/apache/spark/pull/42881
>>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>>
>>> So, we need to manually update this file. I can manually submit an
>>> update first to get this feature working.
>>> --
>>> *发件人:* Jungtaek Lim 
>>> *发送时间:* 2024年3月4日 6:34:42
>>> *收件人:* Dongjoon Hyun
>>> *抄送:* dev; user
>>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>>
>>> Shall we revisit this functionality? The API doc is built with
>>> individual versions, and for each individual version we depend on other
>>> released versions. This does not seem to be right to me. Also, the
>>> functionality is only in PySpark API doc which does not seem to be
>>> consistent as well.
>>>
>>> I don't think this is manageable with the current approach (listing
>>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>>> How about the time we are going to release the new version after releasing
>>> 10 versions? What's the criteria of pruning the version?
>>>
>>> Unless we have a good answer to these questions, I think it's better to
>>> revert the functionality - it missed various considerations.
>>>
>>> On Fri, Mar 1,

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this
as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving
confusion that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the
latest version of version lines. If you expand this to EOLed version lines
and versions which aren't the latest in their version line, the problem
gets much more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:

> Based on my understanding, we should not update versions that have already
> been released,
>
> such as the situation you mentioned: `But what about dropout of version D?
> Should we add E in the dropdown?` We only need to record the latest
> `version. json` file that has already been published at the time of each
> new document release.
>
> Of course, if we need to keep the latest in every document, I think it's
> also possible.
>
> Only by sharing the same version. json file in each version.
> ------
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 16:47:30
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> But this does not answer my question about updating the dropdown for the
> doc of "already released versions", right?
>
> Let's say we just released version D, and the dropdown has version A, B,
> C. We have another release tomorrow as version E, and it's probably easy to
> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
> Should we add E in the dropdown? How do we maintain it if we will have 10
> releases afterwards?
>
> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>
>> According to my understanding, the original intention of this feature is
>> that when a user has entered the pyspark document, if he finds that the
>> version he is currently in is not the version he wants, he can easily jump
>> to the version he wants by clicking on the drop-down box. Additionally, in
>> this PR, the current automatic mechanism for PRs did not merge in.
>>
>> https://github.com/apache/spark/pull/42881
>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>
>> So, we need to manually update this file. I can manually submit an update
>> first to get this feature working.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月4日 6:34:42
>> *收件人:* Dongjoon Hyun
>> *抄送:* dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed t

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the
doc of "already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C.
We have another release tomorrow as version E, and it's probably easy to
add A, B, C, D in the dropdown of E. But what about dropdown of version D?
Should we add E in the dropdown? How do we maintain it if we will have 10
releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:

> According to my understanding, the original intention of this feature is
> that when a user has entered the pyspark document, if he finds that the
> version he is currently in is not the version he wants, he can easily jump
> to the version he wants by clicking on the drop-down box. Additionally, in
> this PR, the current automatic mechanism for PRs did not merge in.
>
> https://github.com/apache/spark/pull/42881
>
> So, we need to manually update this file. I can manually submit an update
> first to get this feature working.
> ------
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月4日 6:34:42
> *收件人:* Dongjoon Hyun
> *抄送:* dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> <https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> <https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>> <https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> <https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=KwooIjNwx9R5XjkTxvpqs6ApF2YX2ZujKl%2bha1PX%2bf3X4CQowIWtvSFmFPVO1297fFYMkgFMgmFuEBDkuDwpig%3d%3d>
>>>
>>> PySpark Overview
>>> <https://mailshield.baidu.com/check?q=rahGq5g%2bcbjBOU3xXCbESExdvGhXXTpk%2f%2f3BUMatX7zAgGbgcBy3mkuJmlmgtZZIoahnY2Cj2t4uylAFmefkTY1%2bQbN0rqSWYUU6qjrQRqY%3d>
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version
switcher, it should be in versionless doc (spark-website) rather than the
doc being pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon  wrote:

> Is this related to https://github.com/apache/spark/pull/42428?
>
> cc @Yang,Jie(INF) 
>
> On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
> wrote:
>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed to be documented.
>>> The contributor proposed the way to update the version "automatically",
>>> but the PR wasn't merged. As a result, we are neither having the
>>> instruction how to bump the version manually, nor having the automatic bump.
>>>
>>> * PR for addition of dropdown:
>>> https://github.com/apache/spark/pull/42428
>>> * PR for automatically bumping version:
>>> https://github.com/apache/spark/pull/42881
>>>
>>> We will probably need to add an instruction in the release process to
>>> update the version. (For automatic bumping I don't have a good idea.)
>>> I'll look into it. Please expect some delay during the holiday weekend
>>> in S. Korea.
>>>
>>> Thanks again.
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> BTW, Jungtaek.
>>>>
>>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>>
>>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>>
>>>> PySpark Overview
>>>> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>>>>
>>>>Date: Feb 24, 2024 Version: master
>>>>
>>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>>
>>>>
>>>> Could you do the follow-up, please?
>>>>
>>>> Thank you in advance.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>>
>>>>> Excellent work, congratulations!
>>>>>
>>>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>
>>>>>> Congratulations!
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>>>>
>>>>>>> Congratulations!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>>>>
>>>>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>>>>> strongly
>>>>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>>>>
>>>>>>> To download Spark 3.5.1, head over to the download page:
>>>>>>> https://spark.apache.org/downloads.html
>>>>>>>
>>>>>>> To view the release notes:
>>>>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>>>>
>>>>>>> We would like to acknowledge all community members for contributing
>>>>>>> to this
>>>>>>> release. This release would not have been possible without you.
>>>>>>>
>>>>>>> Jungtaek Lim
>>>>>>>
>>>>>>> ps. Yikun is helping us through releasing the official docker image
>>>>>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>>>>>> available.
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual
versions, and for each individual version we depend on other released
versions. This does not seem to be right to me. Also, the functionality is
only in PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing
versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
How about the time we are going to release the new version after releasing
10 versions? What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to
revert the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
wrote:

> Thanks for reporting - this is odd - the dropdown did not exist in other
> recent releases.
>
> https://spark.apache.org/docs/3.5.0/api/python/index.html
> https://spark.apache.org/docs/3.4.2/api/python/index.html
> https://spark.apache.org/docs/3.3.4/api/python/index.html
>
> Looks like the dropdown feature was recently introduced but partially
> done. The addition of a dropdown was done, but the way how to bump the
> version was missed to be documented.
> The contributor proposed the way to update the version "automatically",
> but the PR wasn't merged. As a result, we are neither having the
> instruction how to bump the version manually, nor having the automatic bump.
>
> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
> * PR for automatically bumping version:
> https://github.com/apache/spark/pull/42881
>
> We will probably need to add an instruction in the release process to
> update the version. (For automatic bumping I don't have a good idea.)
> I'll look into it. Please expect some delay during the holiday weekend
> in S. Korea.
>
> Thanks again.
> Jungtaek Lim (HeartSaVioR)
>
>
> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
> wrote:
>
>> BTW, Jungtaek.
>>
>> PySpark document seems to show a wrong branch. At this time, `master`.
>>
>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>
>> PySpark Overview
>> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>>
>>Date: Feb 24, 2024 Version: master
>>
>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>
>>
>> Could you do the follow-up, please?
>>
>> Thank you in advance.
>>
>> Dongjoon.
>>
>>
>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>
>>> Excellent work, congratulations!
>>>
>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Congratulations!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>>
>>>>> Congratulations!
>>>>>
>>>>>
>>>>>
>>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>>
>>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>>> strongly
>>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>>
>>>>> To download Spark 3.5.1, head over to the download page:
>>>>> https://spark.apache.org/downloads.html
>>>>>
>>>>> To view the release notes:
>>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>>
>>>>> We would like to acknowledge all community members for contributing to
>>>>> this
>>>>> release. This release would not have been possible without you.
>>>>>
>>>>> Jungtaek Lim
>>>>>
>>>>> ps. Yikun is helping us through releasing the official docker image
>>>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>>>> available.
>>>>>
>>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other
recent releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done.
The addition of a dropdown was done, but the way how to bump the version
was missed to be documented.
The contributor proposed the way to update the version "automatically", but
the PR wasn't merged. As a result, we are neither having the instruction
how to bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: https://github.com/apache/spark/pull/42428
* PR for automatically bumping version:
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to
update the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend
in S. Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
wrote:

> BTW, Jungtaek.
>
> PySpark document seems to show a wrong branch. At this time, `master`.
>
> https://spark.apache.org/docs/3.5.1/api/python/index.html
>
> PySpark Overview
> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>
>Date: Feb 24, 2024 Version: master
>
> [image: Screenshot 2024-02-29 at 21.12.24.png]
>
>
> Could you do the follow-up, please?
>
> Thank you in advance.
>
> Dongjoon.
>
>
> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>
>> Excellent work, congratulations!
>>
>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>> wrote:
>>
>>> Congratulations!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>
>>>> Congratulations!
>>>>
>>>>
>>>>
>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>
>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>> strongly
>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>
>>>> To download Spark 3.5.1, head over to the download page:
>>>> https://spark.apache.org/downloads.html
>>>>
>>>> To view the release notes:
>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>
>>>> We would like to acknowledge all community members for contributing to
>>>> this
>>>> release. This release would not have been possible without you.
>>>>
>>>> Jungtaek Lim
>>>>
>>>> ps. Yikun is helping us through releasing the official docker image for
>>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally 
>>>> available.
>>>>
>>>>
>>
>> --
>> John Zhuge
>>
>


[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-5-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Jungtaek Lim

ps. Yikun is helping us through releasing the official docker image for
Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.


Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-23 Thread Jungtaek Lim
Thanks for figuring this out. That is my bad. My understanding is that
3.5.1 RC2 doc should be correctly generated in VOTE but it happened during
the finalization step.

I lost the build artifact for docs (I followed steps and removed docs from
dev dist before realizing I shouldn't remove them) and I accidentally
rebuilt the doc with the branch which I used for debugging issue in RC.

I'll rebuild the doc from tag and submit a PR again.

On Sat, Feb 24, 2024 at 7:16 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Unfortunately, the Apache Spark `3.5.1 RC2` document artifact seems to be
> generated from unknown source code instead of the correct source code of
> the tag, `3.5.1`.
>
> https://spark.apache.org/docs/3.5.1/
>
> [image: Screenshot 2024-02-23 at 14.13.07.png]
>
> Dongjoon.
>
>
>
> On Wed, Feb 21, 2024 at 7:15 AM Jungtaek Lim 
> wrote:
>
>> Thanks everyone for participating the vote! The vote passed.
>> I'll send out the vote result and proceed to the next steps.
>>
>> On Wed, Feb 21, 2024 at 4:36 PM Maxim Gekk 
>> wrote:
>>
>>> +1
>>>
>>> On Wed, Feb 21, 2024 at 9:50 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, 20 Feb 2024 at 22:00, Cheng Pan  wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> - Build successfully from source code.
>>>>> - Pass integration tests with Spark ClickHouse Connector[1]
>>>>>
>>>>> [1] https://github.com/housepower/spark-clickhouse-connector/pull/299
>>>>>
>>>>> Thanks,
>>>>> Cheng Pan
>>>>>
>>>>>
>>>>> > On Feb 20, 2024, at 10:56, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>> >
>>>>> > Thanks Sean, let's continue the process for this RC.
>>>>> >
>>>>> > +1 (non-binding)
>>>>> >
>>>>> > - downloaded all files from URL
>>>>> > - checked signature
>>>>> > - extracted all archives
>>>>> > - ran all tests from source files in source archive file, via
>>>>> running "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
>>>>> >
>>>>> > Also bump to dev@ to encourage participation - looks like the
>>>>> timing is not good for US folks but let's see more days.
>>>>> >
>>>>> >
>>>>> > On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
>>>>> > Yeah let's get that fix in, but it seems to be a minor test only
>>>>> issue so should not block release.
>>>>> >
>>>>> > On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>>>>> > Very sorry. When I was fixing `SPARK-45242 (
>>>>> https://github.com/apache/spark/pull/43594)`
>>>>> <https://github.com/apache/spark/pull/43594)>, I noticed that its
>>>>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>>>>> didn't realize that it had also been merged into branch-3.5, so I didn't
>>>>> advocate for SPARK-45357 to be backported to branch-3.5.
>>>>> >  As far as I know, the condition to trigger this test failure is:
>>>>> when using Maven to test the `connect` module, if  `sparkTestRelation` in
>>>>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>>>>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>>>>> is indeed related to the order in which Maven executes the test cases in
>>>>> the `connect` module.
>>>>> >  I have submitted a backport PR to branch-3.5, and if necessary, we
>>>>> can merge it to fix this test issue.
>>>>> >  Jie Yang
>>>>> >   发件人: Jungtaek Lim 
>>>>> > 日期: 2024年2月16日 星期五 22:15
>>>>> > 收件人: Sean Owen , Rui Wang 
>>>>> > 抄送: dev 
>>>>> > 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>>>> >   I traced back relevant changes and got a sense of what happened.
>>>>> >   Yangjie figured out the issue via link. It's a tricky issue
>>>>> according to the comments from Yangjie - the test is dependent on ordering
>>>>> of execution for test suites. He said it does not fail in sbt, hence CI
>>>>> build couldn't catch it.
>>>>> > He fixed it via link, but we missed that the off

[VOTE][RESULT] Release Apache Spark 3.5.1 (RC2)

2024-02-21 Thread Jungtaek Lim
The vote passes with 6 +1s (4 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
Jungtaek Lim
Wenchen Fan (*)
Cheng Pan
Xiao Li (*)
Hyukjin Kwon (*)
Maxim Gekk (*)

+0: None

-1: None


Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-21 Thread Jungtaek Lim
Thanks everyone for participating the vote! The vote passed.
I'll send out the vote result and proceed to the next steps.

On Wed, Feb 21, 2024 at 4:36 PM Maxim Gekk 
wrote:

> +1
>
> On Wed, Feb 21, 2024 at 9:50 AM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Tue, 20 Feb 2024 at 22:00, Cheng Pan  wrote:
>>
>>> +1 (non-binding)
>>>
>>> - Build successfully from source code.
>>> - Pass integration tests with Spark ClickHouse Connector[1]
>>>
>>> [1] https://github.com/housepower/spark-clickhouse-connector/pull/299
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Feb 20, 2024, at 10:56, Jungtaek Lim 
>>> wrote:
>>> >
>>> > Thanks Sean, let's continue the process for this RC.
>>> >
>>> > +1 (non-binding)
>>> >
>>> > - downloaded all files from URL
>>> > - checked signature
>>> > - extracted all archives
>>> > - ran all tests from source files in source archive file, via running
>>> "sbt clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.
>>> >
>>> > Also bump to dev@ to encourage participation - looks like the timing
>>> is not good for US folks but let's see more days.
>>> >
>>> >
>>> > On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:
>>> > Yeah let's get that fix in, but it seems to be a minor test only issue
>>> so should not block release.
>>> >
>>> > On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>>> > Very sorry. When I was fixing `SPARK-45242 (
>>> https://github.com/apache/spark/pull/43594)`
>>> <https://github.com/apache/spark/pull/43594)>, I noticed that its
>>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>>> didn't realize that it had also been merged into branch-3.5, so I didn't
>>> advocate for SPARK-45357 to be backported to branch-3.5.
>>> >  As far as I know, the condition to trigger this test failure is: when
>>> using Maven to test the `connect` module, if  `sparkTestRelation` in
>>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>>> is indeed related to the order in which Maven executes the test cases in
>>> the `connect` module.
>>> >  I have submitted a backport PR to branch-3.5, and if necessary, we
>>> can merge it to fix this test issue.
>>> >  Jie Yang
>>> >   发件人: Jungtaek Lim 
>>> > 日期: 2024年2月16日 星期五 22:15
>>> > 收件人: Sean Owen , Rui Wang 
>>> > 抄送: dev 
>>> > 主题: Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>> >   I traced back relevant changes and got a sense of what happened.
>>> >   Yangjie figured out the issue via link. It's a tricky issue
>>> according to the comments from Yangjie - the test is dependent on ordering
>>> of execution for test suites. He said it does not fail in sbt, hence CI
>>> build couldn't catch it.
>>> > He fixed it via link, but we missed that the offending commit was also
>>> ported back to 3.5 as well, hence the fix wasn't ported back to 3.5.
>>> >   Surprisingly, I can't reproduce locally even with maven. In my
>>> attempt to reproduce, SparkConnectProtoSuite was executed at third,
>>> SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite, and
>>> then SparkConnectProtoSuite. Maybe very specific to the environment, not
>>> just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I used
>>> build/mvn (Maven 3.8.8).
>>> >   I'm not 100% sure this is something we should fail the release as
>>> it's a test only and sounds very environment dependent, but I'll respect
>>> your call on vote.
>>> >   Btw, looks like Rui also made a relevant fix via link (not to fix
>>> the failing test but to fix other issues), but this also wasn't ported back
>>> to 3.5. @Rui Wang Do you think this is a regression issue and warrants a
>>> new RC?
>>> > On Fri, Feb 16, 2024 at 11:38 AM Sean Owen 
>>> wrote:
>>> > Is anyone seeing this Spark Connect test failure? then again, I have
>>> some weird issue with this env that always fails 1 or 2 tests that nobody
>>> else can replicate.
>>> >   - Test observe *** FAILED ***
>>> >   == FAIL: Plans do not match ===
>>> >   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
>>

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-19 Thread Jungtaek Lim
Thanks Sean, let's continue the process for this RC.

+1 (non-binding)

- downloaded all files from URL
- checked signature
- extracted all archives
- ran all tests from source files in source archive file, via running "sbt
clean test package" - Ubuntu 20.04.4 LTS, OpenJDK 17.0.9.

Also bump to dev@ to encourage participation - looks like the timing is not
good for US folks but let's see more days.


On Sat, Feb 17, 2024 at 1:49 AM Sean Owen  wrote:

> Yeah let's get that fix in, but it seems to be a minor test only issue so
> should not block release.
>
> On Fri, Feb 16, 2024, 9:30 AM yangjie01  wrote:
>
>> Very sorry. When I was fixing `SPARK-45242 (
>> https://github.com/apache/spark/pull/43594)`
>> <https://github.com/apache/spark/pull/43594)>, I noticed that its
>> `Affects Version` and `Fix Version` of SPARK-45242 were both 4.0, and I
>> didn't realize that it had also been merged into branch-3.5, so I didn't
>> advocate for SPARK-45357 to be backported to branch-3.5.
>>
>>
>>
>> As far as I know, the condition to trigger this test failure is: when
>> using Maven to test the `connect` module, if  `sparkTestRelation` in
>> `SparkConnectProtoSuite` is not the first `DataFrame` to be initialized,
>> then the `id` of `sparkTestRelation` will no longer be 0. So, I think this
>> is indeed related to the order in which Maven executes the test cases in
>> the `connect` module.
>>
>>
>>
>> I have submitted a backport PR
>> <https://github.com/apache/spark/pull/45141> to branch-3.5, and if
>> necessary, we can merge it to fix this test issue.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年2月16日 星期五 22:15
>> *收件人**: *Sean Owen , Rui Wang 
>> *抄送**: *dev 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.1 (RC2)
>>
>>
>>
>> I traced back relevant changes and got a sense of what happened.
>>
>>
>>
>> Yangjie figured out the issue via link
>> <https://mailshield.baidu.com/check?q=8dOSfwXDFpe5HSp%2b%2bgCPsNQ52B7S7TAFG56Vj3tiFgMkCyOrQEGbg03AVWDX5bwwyIW7sZx3JZox3w8Jz1iw%2bPjaOZYmLWn2>.
>> It's a tricky issue according to the comments from Yangjie - the test is
>> dependent on ordering of execution for test suites. He said it does not
>> fail in sbt, hence CI build couldn't catch it.
>>
>> He fixed it via link
>> <https://mailshield.baidu.com/check?q=ojK3dg%2fDFf3xmQ8SPzsIou3EKaE1ZePctdB%2fUzhWmewnZb5chnQM1%2f8D1JDJnkxF>,
>> but we missed that the offending commit was also ported back to 3.5 as
>> well, hence the fix wasn't ported back to 3.5.
>>
>>
>>
>> Surprisingly, I can't reproduce locally even with maven. In my attempt to
>> reproduce, SparkConnectProtoSuite was executed at
>> third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
>> and then SparkConnectProtoSuite. Maybe very specific to the environment,
>> not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
>> used build/mvn (Maven 3.8.8).
>>
>>
>>
>> I'm not 100% sure this is something we should fail the release as it's a
>> test only and sounds very environment dependent, but I'll respect your call
>> on vote.
>>
>>
>>
>> Btw, looks like Rui also made a relevant fix via link
>> <https://mailshield.baidu.com/check?q=TUbVzroxG%2fbi2P4qN0kbggzXuPzSN%2bKDoUFGhS9xMet8aXVw6EH0rMr1MKJqp2E2>
>>  (not
>> to fix the failing test but to fix other issues), but this also wasn't
>> ported back to 3.5. @Rui Wang  Do you think this
>> is a regression issue and warrants a new RC?
>>
>>
>>
>>
>>
>> On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:
>>
>> Is anyone seeing this Spark Connect test failure? then again, I have some
>> weird issue with this env that always fails 1 or 2 tests that nobody else
>> can replicate.
>>
>>
>>
>> - Test observe *** FAILED ***
>>   == FAIL: Plans do not match ===
>>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
>> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
>> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
>> 44
>>+- LocalRelation , [id#0, name#0]
>>   +- LocalRelation , [id#0, name#0]
>> (PlanTest.scala:179)
>>
>>
>>
>> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc gene

Re: [VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-16 Thread Jungtaek Lim
I traced back relevant changes and got a sense of what happened.

Yangjie figured out the issue via link
<https://github.com/apache/spark/pull/43010#discussion_r1338737506>. It's a
tricky issue according to the comments from Yangjie - the test is dependent
on ordering of execution for test suites. He said it does not fail in sbt,
hence CI build couldn't catch it.
He fixed it via link <https://github.com/apache/spark/pull/43155>, but we
missed that the offending commit was also ported back to 3.5 as well, hence
the fix wasn't ported back to 3.5.

Surprisingly, I can't reproduce locally even with maven. In my attempt to
reproduce, SparkConnectProtoSuite was executed at
third, SparkConnectStreamingQueryCacheSuite, and ExecuteEventsManagerSuite,
and then SparkConnectProtoSuite. Maybe very specific to the environment,
not just maven? My env: MBP M1 pro chip, MacOS 14.3.1, Openjdk 17.0.9. I
used build/mvn (Maven 3.8.8).

I'm not 100% sure this is something we should fail the release as it's a
test only and sounds very environment dependent, but I'll respect your call
on vote.

Btw, looks like Rui also made a relevant fix via link
<https://github.com/apache/spark/pull/43594> (not to fix the failing test
but to fix other issues), but this also wasn't ported back to 3.5. @Rui Wang
 Do you think this is a regression issue and warrants
a new RC?


On Fri, Feb 16, 2024 at 11:38 AM Sean Owen  wrote:

> Is anyone seeing this Spark Connect test failure? then again, I have some
> weird issue with this env that always fails 1 or 2 tests that nobody else
> can replicate.
>
> - Test observe *** FAILED ***
>   == FAIL: Plans do not match ===
>   !CollectMetrics my_metric, [min(id#0) AS min_val#0, max(id#0) AS
> max_val#0, sum(id#0) AS sum(id)#0L], 0   CollectMetrics my_metric,
> [min(id#0) AS min_val#0, max(id#0) AS max_val#0, sum(id#0) AS sum(id)#0L],
> 44
>+- LocalRelation , [id#0, name#0]
>   +- LocalRelation , [id#0, name#0]
> (PlanTest.scala:179)
>
> On Thu, Feb 15, 2024 at 1:34 PM Jungtaek Lim 
> wrote:
>
>> DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
>> out doc generation issue after tagging RC1.
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.5.1.
>>
>> The vote is open until February 18th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.5.1-rc2 (commit
>> fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
>> https://github.com/apache/spark/tree/v3.5.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1452/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-docs/
>>
>> The list of bug fixes going into 3.5.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353495
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/pyspark-3.5.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.5.1?
>> ===
>>
>> The current list of open tickets targeted at 3.5.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.1
>>
>> Committer

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-15 Thread Jungtaek Lim
UPDATE: The vote thread is up now.
https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j


On Tue, Feb 6, 2024 at 11:30 AM Jungtaek Lim 
wrote:

> Thanks all for the positive feedback! Will figure out time to go through
> the RC process. Stay tuned!
>
> On Mon, Feb 5, 2024 at 7:46 AM Gengliang Wang  wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala  wrote:
>>
>>> +1
>>>
>>> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge  wrote:
>>>
>>>> +1
>>>>
>>>> John Zhuge
>>>>
>>>>
>>>> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
>>>>  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
>>>>>>>> 写入:
>>>>>>>>
>>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>>
>>>>>>>> Jungtaek Lim >>>>>>> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>>>>>>>> >
>>>>>>>> > Hi dev,
>>>>>>>> >
>>>>>>>> > looks like there are a huge number of commits being pushed to
>>>>>>>> branch-3.5 after 3.5.0 was released, 200+ commits.
>>>>>>>> >
>>>>>>>> > $ git log --oneline v3.5.0..HEAD | wc -l
>>>>>>>> > 202
>>>>>>>> >
>>>>>>>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed
>>>>>>>> version, and 10 resolved issues are either marked as blocker (even
>>>>>>>> correctness issues) or critical, which justifies the release.
>>>>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>>>>>>>> >
>>>>>>>> > What do you think about releasing 3.5.1 with the current head of
>>>>>>>> branch-3.5? I'm happy to volunteer as the release manager.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> dev-unsubscr...@spark.apache.org>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>


Re: Heads-up: Update on Spark 3.5.1 RC

2024-02-15 Thread Jungtaek Lim
UPDATE: Now the vote thread is up for RC2.
https://lists.apache.org/thread/f28h0brncmkoyv5mtsqtxx38hx309c2j

On Wed, Feb 14, 2024 at 2:59 AM Dongjoon Hyun 
wrote:

> Thank you for the update, Jungtaek.
>
> Dongjoon.
>
> On Tue, Feb 13, 2024 at 7:29 AM Jungtaek Lim 
> wrote:
>
>> Hi,
>>
>> Just a head-up since I didn't give an update for a week after the last
>> update from the discussion thread.
>>
>> I've been following the automated release process and encountered several
>> issues. Maybe I will file JIRA tickets and follow PRs.
>>
>> Issues I figured out so far are 1) python library version issue in the
>> release docker image, 2) doc build failure in pyspark ml for Spark connect.
>> I'm deferring to submit fixes till I see dry-run to succeed.
>>
>> Btw, I optimistically ran the process without a dry-run as GA has been
>> paased (my bad), and the tag for RC1 being created was done before I saw
>> issues. Maybe I'll need to start with RC2 after things are sorted out and
>> necessary fixes are landed to branch-3.5.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>


[VOTE] Release Apache Spark 3.5.1 (RC2)

2024-02-15 Thread Jungtaek Lim
DISCLAIMER: RC for Apache Spark 3.5.1 starts with RC2 as I lately figured
out doc generation issue after tagging RC1.

Please vote on releasing the following candidate as Apache Spark version
3.5.1.

The vote is open until February 18th 9AM (PST) and passes if a majority +1
PMC votes are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.5.1-rc2 (commit
fd86f85e181fc2dc0f50a096855acf83a6cc5d9c):
https://github.com/apache/spark/tree/v3.5.1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1452/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-docs/

The list of bug fixes going into 3.5.1 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353495

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC via "pip install
https://dist.apache.org/repos/dist/dev/spark/v3.5.1-rc2-bin/pyspark-3.5.1.tar.gz
"
and see if anything important breaks.
In the Java/Scala, you can add the staging repository to your projects
resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.5.1?
===

The current list of open tickets targeted at 3.5.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Heads-up: Update on Spark 3.5.1 RC

2024-02-13 Thread Jungtaek Lim
Hi,

Just a head-up since I didn't give an update for a week after the last
update from the discussion thread.

I've been following the automated release process and encountered several
issues. Maybe I will file JIRA tickets and follow PRs.

Issues I figured out so far are 1) python library version issue in the
release docker image, 2) doc build failure in pyspark ml for Spark connect.
I'm deferring to submit fixes till I see dry-run to succeed.

Btw, I optimistically ran the process without a dry-run as GA has been
paased (my bad), and the tag for RC1 being created was done before I saw
issues. Maybe I'll need to start with RC2 after things are sorted out and
necessary fixes are landed to branch-3.5.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: Enhanced Console Sink for Structured Streaming

2024-02-05 Thread Jungtaek Lim
Maybe we could keep the default as it is, and explicitly turn on
verboseMode to enable auxiliary information. I'm not a believer that anyone
will parse the output of console sink (which means this could be a breaking
change), but changing the default behavior should be taken conservatively.
We can highlight the mode on the guide doc, which would be good enough to
publicize the improvement.

Other than that, the proposal looks good to me. Adding some more details
may be appropriate - e.g. what if there are multiple stateful operators,
what if there are 100 state rows in the state store, etc. One sketched idea
is to employ multiple verbosity levels and list up all state store rows in
full verbosity, otherwise maybe the number of state store rows. This is
just one example for the details.

On Sun, Feb 4, 2024 at 3:22 AM Neil Ramaswamy
 wrote:

> Re: verbosity: yes, it will be more verbose. A config I was planning to
> implement was a default-on console sink option, verboseMode, that you can
> set to be off if you just want sink data. I don't think that introduces
> additional complexity, as the last point suggests. (And also, nobody should
> be using this for "high data throughput" scenarios or
> "performance-sensitive applications". It's a development sink.)
>
> I don't think that exposing these details increases the learning curve:
> these details are *essential *for understanding how Structured Streaming
> works. I'd actually argue that it makes the learning curve shallower: by
> showing the few variables that affect the behavior of their pipelines,
> they'll have the conceptual understanding to answer essential questions
> like "why aren't my results showing up?" or "why is my state size always
> increasing?"
>
> Also: for stateless pipelines, none of this event-time and state detail
> applies. We would just render sink data—no behavior change from today. That
> seems gentle enough to me: start with stateless pipelines and see
> the output rows, but when you advance to stateful pipelines, you need to
> deal with the two complexities (event-time and state) of stateful streaming.
>
> On Sat, Feb 3, 2024 at 3:08 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> As I understood, the proposal you mentioned suggests adding event-time
>> and state store metadata to the console sink to better highlight the
>> semantics of the Structured Streaming engine. While I agree this
>> enhancement can provide valuable insights into the engine's behavior
>> especially for newcomers, there are potential challenges that we need to be
>> aware of:
>>
>> - Including additional metadata in the console sink output can increase
>> the volume of information printed. This might result in a more verbose
>> console output, making it harder to observe the actual data from the
>> metadata, especially in scenarios with high data throughput.
>> - Added verbosity, the proposed additional metadata may make the console
>> output more verbose, potentially affecting its readability, especially for
>> users who are primarily interested in the processed data and not the
>> internal engine details.
>> - Users unfamiliar with the internal workings of Structured Streaming
>> might misinterpret the metadata as part of the actual data, leading to
>> confusion.
>> - The act of printing additional metadata to the console may introduce
>> some overhead, especially in scenarios where high-frequency updates occur.
>> While this overhead might be minimal, it is worth considering it in
>> performance-sensitive applications.
>> - While the proposal aims to make it easier for beginners to understand
>> concepts like watermarks, operator state, and output rows, it could
>> potentially increase the learning curve due to the introduction of
>> additional terminology and information.
>> - Users might benefit from the ability to selectively enable or disable
>> the display of certain metadata elements to tailor the console output to
>> their specific needs. However, this introduces additional complexity.
>>
>> As usual with these things, your mileage varies. Whilst the proposed
>> enhancements offer valuable insights into the behavior of Structured
>> Streaming, we ought to think about the potential downsides, particularly in
>> terms of increased verbosity, complexity, and the impact on user experience
>>
>> HTH
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 3 Feb 2024 at 01:32, Neil 

Re: Re: [DISCUSS] Release Spark 3.5.1?

2024-02-05 Thread Jungtaek Lim
Thanks all for the positive feedback! Will figure out time to go through
the RC process. Stay tuned!

On Mon, Feb 5, 2024 at 7:46 AM Gengliang Wang  wrote:

> +1
>
> On Sun, Feb 4, 2024 at 1:57 PM Hussein Awala  wrote:
>
>> +1
>>
>> On Sun, Feb 4, 2024 at 10:13 PM John Zhuge  wrote:
>>
>>> +1
>>>
>>> John Zhuge
>>>
>>>
>>> On Sun, Feb 4, 2024 at 11:23 AM Santosh Pingale
>>>  wrote:
>>>
>>>> +1
>>>>
>>>> On Sun, Feb 4, 2024, 8:18 PM Xiao Li 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Sun, Feb 4, 2024 at 6:07 AM beliefer  wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024-02-04 15:26:13,"Dongjoon Hyun"  写道:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Sat, Feb 3, 2024 at 9:18 PM yangjie01 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> 在 2024/2/4 13:13,“Kent Yao”mailto:y...@apache.org>>
>>>>>>> 写入:
>>>>>>>
>>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>>
>>>>>>> Jungtaek Lim >>>>>> kabhwan.opensou...@gmail.com>> 于2024年2月3日周六 21:14写道:
>>>>>>> >
>>>>>>> > Hi dev,
>>>>>>> >
>>>>>>> > looks like there are a huge number of commits being pushed to
>>>>>>> branch-3.5 after 3.5.0 was released, 200+ commits.
>>>>>>> >
>>>>>>> > $ git log --oneline v3.5.0..HEAD | wc -l
>>>>>>> > 202
>>>>>>> >
>>>>>>> > Also, there are 180 JIRA tickets containing 3.5.1 as fixed
>>>>>>> version, and 10 resolved issues are either marked as blocker (even
>>>>>>> correctness issues) or critical, which justifies the release.
>>>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12353495 <
>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353495>
>>>>>>> >
>>>>>>> > What do you think about releasing 3.5.1 with the current head of
>>>>>>> branch-3.5? I'm happy to volunteer as the release manager.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> dev-unsubscr...@spark.apache.org>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>


[DISCUSS] Release Spark 3.5.1?

2024-02-03 Thread Jungtaek Lim
Hi dev,

looks like there are a huge number of commits being pushed to branch-3.5
after 3.5.0 was released, 200+ commits.

$ git log --oneline v3.5.0..HEAD | wc -l
202

Also, there are 180 JIRA tickets containing 3.5.1 as fixed version, and 10
resolved issues are either marked as blocker (even correctness issues) or
critical, which justifies the release.
https://issues.apache.org/jira/projects/SPARK/versions/12353495

What do you think about releasing 3.5.1 with the current head of
branch-3.5? I'm happy to volunteer as the release manager.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: Spark 3.5.1

2024-01-31 Thread Jungtaek Lim
Hi,

I agreed it's time to release 3.5.1. 10 resolved issues are either marked
as blocker (even correctness issues) or critical, which justifies the
release.

I had been trying to find the time to take a step, but had no luck with it.
I'll give it another try this week (it needs some time as I'm not familiar
with Spark project's release process), and seek another volunteer if I
can't make any progress.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Jan 30, 2024 at 7:15 PM Santosh Pingale
 wrote:

> Hey there
>
> Spark 3.5 branch has accumulated 199 commits with quite a few bug
> fixes related to correctness. Are there any plans for releasing 3.5.1?
>
> Kind regards
> Santosh
>


[VOTE][RESULT] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-11 Thread Jungtaek Lim
The vote passes with 12 +1s (3 binding +1s).
Thanks to all who reviews the SPIP doc and votes!

(* = binding)
+1:
- Jungtaek Lim
- Anish Shrigondekar
- Mich Talebzadeh
- Raghu Angadi
- 刘唯
- Shixiong Zhu (*)
- Bartosz Konieczny
- Praveen Gattu
- Burak Yavuz
- Bhuwan Sahni
- L. C. Hsieh (*)
- Wenchen Fan (*)

+0: None

-1: None

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-11 Thread Jungtaek Lim
Thanks all for participating! The vote passed. I'll send out the result to
a separate thread.

On Thu, Jan 11, 2024 at 10:37 PM Wenchen Fan  wrote:

> +1
>
> On Thu, Jan 11, 2024 at 9:32 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Wed, Jan 10, 2024 at 9:06 AM Bhuwan Sahni
>>  wrote:
>>
>>> +1. This is a good addition.
>>>
>>> <http://www.databricks.com>
>>> *Bhuwan Sahni*
>>> Staff Software Engineer
>>>
>>> bhuwan.sa...@databricks.com
>>> 500 108th Ave. NE
>>> Bellevue, WA 98004
>>> USA
>>>
>>>
>>> On Wed, Jan 10, 2024 at 9:00 AM Burak Yavuz  wrote:
>>>
>>>> +1. Excited to see more stateful workloads with Structured Streaming!
>>>>
>>>>
>>>> Best,
>>>> Burak
>>>>
>>>> On Wed, Jan 10, 2024 at 8:21 AM Praveen Gattu
>>>>  wrote:
>>>>
>>>>> +1. This brings Structured Streaming a good solution for
>>>>> customers wanting to build stateful stream processing applications.
>>>>>
>>>>> On Wed, Jan 10, 2024 at 7:30 AM Bartosz Konieczny <
>>>>> bartkoniec...@gmail.com> wrote:
>>>>>
>>>>>> +1 :)
>>>>>>
>>>>>> On Wed, Jan 10, 2024 at 9:57 AM Shixiong Zhu 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (binding)
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Shixiong Zhu
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 9, 2024 at 6:47 PM 刘唯  wrote:
>>>>>>>
>>>>>>>> This is a good addition! +1
>>>>>>>>
>>>>>>>> Raghu Angadi  于2024年1月9日周二
>>>>>>>> 13:17写道:
>>>>>>>>
>>>>>>>>> +1. This is a major improvement to the state API.
>>>>>>>>>
>>>>>>>>> Raghu.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 9, 2024 at 1:42 AM Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> +1 for me as well
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, 9 Jan 2024 at 03:24, Anish Shrigondekar
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Jungtaek for creating the Vote thread.
>>>>>>>>>>>
>>>>>>>>>>> +1 (non-binding) from my side too.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Anish
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 9, 2024 at 6:09 AM Jungtaek Lim <
>>>>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Starting with my +1 (non-binding). Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim <
>>>>>>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to start the vote for SPIP: Structured Streaming -
>>>>>>>>>>>>> Arbitrary State API v2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> References:
>>>>>>>>>>>>>
>>>>>>>>>>>>>- JIRA ticket
>>>>>>>>>>>>><https://issues.apache.org/jira/browse/SPARK-45939>
>>>>>>>>>>>>>- SPIP doc
>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing>
>>>>>>>>>>>>>- Discussion thread
>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://lists.apache.org/thread/3jyjdgk1m5zyqfmrocnt6t415703nc8l>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>>> [ ] -1: I don’t think this is a good idea because …
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bartosz Konieczny
>>>>>> freelance data engineer
>>>>>> https://www.waitingforcode.com
>>>>>> https://github.com/bartosz25/
>>>>>> https://twitter.com/waitingforcode
>>>>>>
>>>>>>


Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-09 Thread Jungtaek Lim
Friendly reminder, VOTE thread is now live!
https://lists.apache.org/thread/16ryx828bwoth31hobknxnjfxjxj07mf
The vote made here is not counted toward, so please ensure you vote in the
VOTE thread. Thanks!

On Tue, Jan 9, 2024 at 9:33 AM Jungtaek Lim 
wrote:

> Thanks everyone for the feedback!
>
> Given that we get positive feedback without major concerns, I will
> initiate the vote thread soon. Please make a vote in that thread as well.
>
> Thanks again!
>
> On Tue, Jan 9, 2024 at 7:44 AM Bhuwan Sahni
>  wrote:
>
>> +1 on the newer APIs. I believe these APIs provide a much powerful
>> mechanism for the user to perform arbitrary state management in Structured
>> Streaming queries.
>>
>> Thanks
>> Bhuwan Sahni
>>
>> On Mon, Jan 8, 2024 at 10:07 AM L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> I left some comments in the SPIP doc and got replies quickly. The new
>>> API looks good and more comprehensive. I think it will help Spark
>>> Structured Streaming to be more useful in more complicated streaming
>>> use cases.
>>>
>>> On Fri, Jan 5, 2024 at 8:15 PM Burak Yavuz  wrote:
>>> >
>>> > I'm also a +1 on the newer APIs. We had a lot of learnings from using
>>> flatMapGroupsWithState and I believe that we can make the APIs a lot easier
>>> to use.
>>> >
>>> > On Wed, Nov 29, 2023 at 6:43 PM Anish Shrigondekar
>>>  wrote:
>>> >>
>>> >> Hi dev,
>>> >>
>>> >> Addressed the comments that Jungtaek had on the doc. Bumping the
>>> thread once again to see if other folks have any feedback on the proposal.
>>> >>
>>> >> Thanks,
>>> >> Anish
>>> >>
>>> >> On Mon, Nov 27, 2023 at 8:15 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>
>>> >>> Kindly bump for better reach after the long holiday. Please kindly
>>> review the proposal which opens the chance to address complex use cases of
>>> streaming. Thanks!
>>> >>>
>>> >>> On Thu, Nov 23, 2023 at 8:19 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>>
>>> >>>> Thanks Anish for proposing SPIP and initiating this thread! I
>>> believe this SPIP will help a bunch of complex use cases on streaming.
>>> >>>>
>>> >>>> dev@: We are coincidentally initiating this discussion in
>>> thanksgiving holidays. We understand people in the US may not have time to
>>> review the SPIP, and we plan to bump this thread in early next week. We are
>>> open for any feedback from non-US during the holiday. We can either address
>>> feedback altogether after the holiday (Anish is in the US) or I can answer
>>> if the feedback is more about the question. Thanks!
>>> >>>>
>>> >>>> On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
>>> anish.shrigonde...@databricks.com> wrote:
>>> >>>>>
>>> >>>>> Hi dev,
>>> >>>>>
>>> >>>>> I would like to start a discussion on "Structured Streaming -
>>> Arbitrary State API v2". This proposal aims to address a bunch of
>>> limitations we see today using mapGroupsWithState/flatMapGroupsWithState
>>> operator. The detailed set of limitations is described in the SPIP doc.
>>> >>>>>
>>> >>>>> We propose to support various features such as multiple state
>>> variables (flexible data modeling), composite types, enhanced timer
>>> functionality, support for chaining operators after new operator, handling
>>> initial state along with state data source, schema evolution etc This will
>>> allow users to write more powerful streaming state management logic
>>> primarily used in operational use-cases. Other built-in stateful operators
>>> could also benefit from such changes in the future.
>>> >>>>>
>>> >>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
>>> >>>>> SPIP:
>>> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
>>> >>>>> Design Doc:
>>> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>>> >>>>>
>>> >>>>> cc - @Jungtaek Lim  who has graciously agreed to be the shepherd
>>> for this project
>>> >>>>>
>>> >>>>> Looking forward to your feedback !
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Anish
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> <http://www.databricks.com>
>> *Bhuwan Sahni*
>> Staff Software Engineer
>>
>> bhuwan.sa...@databricks.com
>> 500 108th Ave. NE
>> Bellevue, WA 98004
>> USA
>>
>


Re: [VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-08 Thread Jungtaek Lim
Starting with my +1 (non-binding). Thanks!

On Tue, Jan 9, 2024 at 9:37 AM Jungtaek Lim 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Structured Streaming - Arbitrary
> State API v2.
>
> References:
>
>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45939>
>- SPIP doc
>
> <https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing>
>- Discussion thread
><https://lists.apache.org/thread/3jyjdgk1m5zyqfmrocnt6t415703nc8l>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>


[VOTE] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-08 Thread Jungtaek Lim
Hi all,

I'd like to start the vote for SPIP: Structured Streaming - Arbitrary State
API v2.

References:

   - JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45939>
   - SPIP doc
   
<https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing>
   - Discussion thread
   <https://lists.apache.org/thread/3jyjdgk1m5zyqfmrocnt6t415703nc8l>

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2024-01-08 Thread Jungtaek Lim
Thanks everyone for the feedback!

Given that we get positive feedback without major concerns, I will initiate
the vote thread soon. Please make a vote in that thread as well.

Thanks again!

On Tue, Jan 9, 2024 at 7:44 AM Bhuwan Sahni
 wrote:

> +1 on the newer APIs. I believe these APIs provide a much powerful
> mechanism for the user to perform arbitrary state management in Structured
> Streaming queries.
>
> Thanks
> Bhuwan Sahni
>
> On Mon, Jan 8, 2024 at 10:07 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> I left some comments in the SPIP doc and got replies quickly. The new
>> API looks good and more comprehensive. I think it will help Spark
>> Structured Streaming to be more useful in more complicated streaming
>> use cases.
>>
>> On Fri, Jan 5, 2024 at 8:15 PM Burak Yavuz  wrote:
>> >
>> > I'm also a +1 on the newer APIs. We had a lot of learnings from using
>> flatMapGroupsWithState and I believe that we can make the APIs a lot easier
>> to use.
>> >
>> > On Wed, Nov 29, 2023 at 6:43 PM Anish Shrigondekar
>>  wrote:
>> >>
>> >> Hi dev,
>> >>
>> >> Addressed the comments that Jungtaek had on the doc. Bumping the
>> thread once again to see if other folks have any feedback on the proposal.
>> >>
>> >> Thanks,
>> >> Anish
>> >>
>> >> On Mon, Nov 27, 2023 at 8:15 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>
>> >>> Kindly bump for better reach after the long holiday. Please kindly
>> review the proposal which opens the chance to address complex use cases of
>> streaming. Thanks!
>> >>>
>> >>> On Thu, Nov 23, 2023 at 8:19 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>>
>> >>>> Thanks Anish for proposing SPIP and initiating this thread! I
>> believe this SPIP will help a bunch of complex use cases on streaming.
>> >>>>
>> >>>> dev@: We are coincidentally initiating this discussion in
>> thanksgiving holidays. We understand people in the US may not have time to
>> review the SPIP, and we plan to bump this thread in early next week. We are
>> open for any feedback from non-US during the holiday. We can either address
>> feedback altogether after the holiday (Anish is in the US) or I can answer
>> if the feedback is more about the question. Thanks!
>> >>>>
>> >>>> On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
>> anish.shrigonde...@databricks.com> wrote:
>> >>>>>
>> >>>>> Hi dev,
>> >>>>>
>> >>>>> I would like to start a discussion on "Structured Streaming -
>> Arbitrary State API v2". This proposal aims to address a bunch of
>> limitations we see today using mapGroupsWithState/flatMapGroupsWithState
>> operator. The detailed set of limitations is described in the SPIP doc.
>> >>>>>
>> >>>>> We propose to support various features such as multiple state
>> variables (flexible data modeling), composite types, enhanced timer
>> functionality, support for chaining operators after new operator, handling
>> initial state along with state data source, schema evolution etc This will
>> allow users to write more powerful streaming state management logic
>> primarily used in operational use-cases. Other built-in stateful operators
>> could also benefit from such changes in the future.
>> >>>>>
>> >>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
>> >>>>> SPIP:
>> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
>> >>>>> Design Doc:
>> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>> >>>>>
>> >>>>> cc - @Jungtaek Lim  who has graciously agreed to be the shepherd
>> for this project
>> >>>>>
>> >>>>> Looking forward to your feedback !
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Anish
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> <http://www.databricks.com>
> *Bhuwan Sahni*
> Staff Software Engineer
>
> bhuwan.sa...@databricks.com
> 500 108th Ave. NE
> Bellevue, WA 98004
> USA
>


Re: Apache Spark 3.3.4 EOL Release?

2023-12-11 Thread Jungtaek Lim
Sorry for the late reply, I've been busy these days and haven't had time to
respond.

I didn't realize you were doing release preparation and discussion in
parallel. I totally agree you should go if you take a step already.

Also, thanks for the suggestion! Unfortunately I got to be busy after
volunteering, but I'll figure out I can make it, hopefully before the end
of this year.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Dec 9, 2023 at 2:22 AM Dongjoon Hyun 
wrote:

> Thank you, Mridul, and Kent, too.
>
> Additionally, thank you for volunteering as a release manager, Jungtaek,
>
> For the 3.3.4 EOL release, I've already been testing and preparing for one
> week since my first email.
>
> So, why don't you proceed with the Apache Spark 3.5.1 release? It has 142
> patches already.
>
> $ git log --oneline v3.5.0..HEAD | wc -l
>  142
>
> I'd like to recommend you to proceed by sending an independent discussion
> email to the dev mailing list.
>
> I love to see Apache Spark 3.5.1 in December. too.
>
> BTW, as you mentioned, there is no strict timeline for 3.5.1, so take your
> time.
>
> Thanks,
> Dongjoon.
>
>
>
> On Fri, Dec 8, 2023 at 2:04 AM Jungtaek Lim 
> wrote:
>
>> +1 to release 3.3.4 and consider 3.3 as EOL.
>>
>> Btw, it'd be probably ideal if we could encourage taking an opportunity
>> of experiencing the release process to people who hadn't had a time to go
>> through (when there are people who are happy to take it). If you don't mind
>> and we are not very strict on the timeline, I'd be happy to volunteer and
>> give it a try.
>>
>> On Tue, Dec 5, 2023 at 12:12 PM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Thank you for driving this EOL release, Dongjoon!
>>>
>>> Kent Yao
>>>
>>> On 2023/12/04 19:40:10 Mridul Muralidharan wrote:
>>> > +1
>>> >
>>> > Regards,
>>> > Mridul
>>> >
>>> > On Mon, Dec 4, 2023 at 11:40 AM L. C. Hsieh  wrote:
>>> >
>>> > > +1
>>> > >
>>> > > Thanks Dongjoon!
>>> > >
>>> > > On Mon, Dec 4, 2023 at 9:26 AM Yang Jie 
>>> wrote:
>>> > > >
>>> > > > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>>> > > >
>>> > > > Jie Yang
>>> > > >
>>> > > > On 2023/12/04 15:08:25 Tom Graves wrote:
>>> > > > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
>>> > > > > Tom
>>> > > > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon
>>> Hyun <
>>> > > dongjoon.h...@gmail.com> wrote:
>>> > > > >
>>> > > > >  Hi, All.
>>> > > > >
>>> > > > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
>>> > > branch-3.3 has been maintained and served well until now.
>>> > > > >
>>> > > > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged
>>> on Jun
>>> > > 9th, 2022)
>>> > > > > -
>>> https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
>>> > > (vote result on June 14th, 2022)
>>> > > > >
>>> > > > > As of today, branch-3.3 has 56 additional patches after v3.3.3
>>> (tagged
>>> > > on Aug 3rd about 4 month ago) and reaches the end-of-life this month
>>> > > according to the Apache Spark release cadence,
>>> > > https://spark.apache.org/versioning-policy.html .
>>> > > > >
>>> > > > > $ git log --oneline v3.3.3..HEAD | wc -l
>>> > > > > 56
>>> > > > >
>>> > > > > Along with the recent Apache Spark 3.4.2 release, I hope the
>>> users can
>>> > > get a chance to have these last bits of Apache Spark 3.3.x, and I'd
>>> like to
>>> > > propose to have Apache Spark 3.3.4 EOL Release vote on December 11th
>>> and
>>> > > volunteer as the release manager.
>>> > > > >
>>> > > > > WDTY?
>>> > > > >
>>> > > > > Please let us know if you need more patches on branch-3.3.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Dongjoon.
>>> > > > >
>>> > > >
>>> > > >
>>> -
>>> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > > >
>>> > >
>>> > > -
>>> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >
>>> > >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Apache Spark 3.3.4 EOL Release?

2023-12-07 Thread Jungtaek Lim
+1 to release 3.3.4 and consider 3.3 as EOL.

Btw, it'd be probably ideal if we could encourage taking an opportunity of
experiencing the release process to people who hadn't had a time to go
through (when there are people who are happy to take it). If you don't mind
and we are not very strict on the timeline, I'd be happy to volunteer and
give it a try.

On Tue, Dec 5, 2023 at 12:12 PM Kent Yao  wrote:

> +1
>
> Thank you for driving this EOL release, Dongjoon!
>
> Kent Yao
>
> On 2023/12/04 19:40:10 Mridul Muralidharan wrote:
> > +1
> >
> > Regards,
> > Mridul
> >
> > On Mon, Dec 4, 2023 at 11:40 AM L. C. Hsieh  wrote:
> >
> > > +1
> > >
> > > Thanks Dongjoon!
> > >
> > > On Mon, Dec 4, 2023 at 9:26 AM Yang Jie  wrote:
> > > >
> > > > +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> > > >
> > > > Jie Yang
> > > >
> > > > On 2023/12/04 15:08:25 Tom Graves wrote:
> > > > >  +1 for a 3.3.4 EOL Release. Thanks Dongjoon.
> > > > > Tom
> > > > > On Friday, December 1, 2023 at 02:48:22 PM CST, Dongjoon Hyun <
> > > dongjoon.h...@gmail.com> wrote:
> > > > >
> > > > >  Hi, All.
> > > > >
> > > > > Since the Apache Spark 3.3.0 RC6 vote passed on Jun 14, 2022,
> > > branch-3.3 has been maintained and served well until now.
> > > > >
> > > > > - https://github.com/apache/spark/releases/tag/v3.3.0 (tagged on
> Jun
> > > 9th, 2022)
> > > > > - https://lists.apache.org/thread/zg6k1spw6k1c7brgo6t7qldvsqbmfytm
> > > (vote result on June 14th, 2022)
> > > > >
> > > > > As of today, branch-3.3 has 56 additional patches after v3.3.3
> (tagged
> > > on Aug 3rd about 4 month ago) and reaches the end-of-life this month
> > > according to the Apache Spark release cadence,
> > > https://spark.apache.org/versioning-policy.html .
> > > > >
> > > > > $ git log --oneline v3.3.3..HEAD | wc -l
> > > > > 56
> > > > >
> > > > > Along with the recent Apache Spark 3.4.2 release, I hope the users
> can
> > > get a chance to have these last bits of Apache Spark 3.3.x, and I'd
> like to
> > > propose to have Apache Spark 3.3.4 EOL Release vote on December 11th
> and
> > > volunteer as the release manager.
> > > > >
> > > > > WDTY?
> > > > >
> > > > > Please let us know if you need more patches on branch-3.3.
> > > > >
> > > > > Thanks,
> > > > > Dongjoon.
> > > > >
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-27 Thread Jungtaek Lim
Kindly bump for better reach after the long holiday. Please kindly review
the proposal which opens the chance to address complex use cases of
streaming. Thanks!

On Thu, Nov 23, 2023 at 8:19 AM Jungtaek Lim 
wrote:

> Thanks Anish for proposing SPIP and initiating this thread! I believe this
> SPIP will help a bunch of complex use cases on streaming.
>
> dev@: We are coincidentally initiating this discussion in thanksgiving
> holidays. We understand people in the US may not have time to review the
> SPIP, and we plan to bump this thread in early next week. We are open for
> any feedback from non-US during the holiday. We can either address feedback
> altogether after the holiday (Anish is in the US) or I can answer if the
> feedback is more about the question. Thanks!
>
> On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
> anish.shrigonde...@databricks.com> wrote:
>
>> Hi dev,
>>
>> I would like to start a discussion on "Structured Streaming - Arbitrary
>> State API v2". This proposal aims to address a bunch of limitations we see
>> today using mapGroupsWithState/flatMapGroupsWithState operator. The
>> detailed set of limitations is described in the SPIP doc.
>>
>> We propose to support various features such as multiple state variables
>> (flexible data modeling), composite types, enhanced timer functionality,
>> support for chaining operators after new operator, handling initial state
>> along with state data source, schema evolution etc This will allow users to
>> write more powerful streaming state management logic primarily used in
>> operational use-cases. Other built-in stateful operators could also benefit
>> from such changes in the future.
>>
>> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
>> SPIP:
>> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
>> Design Doc:
>> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>>
>> cc - @Jungtaek Lim   who has graciously
>> agreed to be the shepherd for this project
>>
>> Looking forward to your feedback !
>>
>> Thanks,
>> Anish
>>
>


Re: [DISCUSS] SPIP: Structured Streaming - Arbitrary State API v2

2023-11-22 Thread Jungtaek Lim
Thanks Anish for proposing SPIP and initiating this thread! I believe this
SPIP will help a bunch of complex use cases on streaming.

dev@: We are coincidentally initiating this discussion in thanksgiving
holidays. We understand people in the US may not have time to review the
SPIP, and we plan to bump this thread in early next week. We are open for
any feedback from non-US during the holiday. We can either address feedback
altogether after the holiday (Anish is in the US) or I can answer if the
feedback is more about the question. Thanks!

On Thu, Nov 23, 2023 at 5:27 AM Anish Shrigondekar <
anish.shrigonde...@databricks.com> wrote:

> Hi dev,
>
> I would like to start a discussion on "Structured Streaming - Arbitrary
> State API v2". This proposal aims to address a bunch of limitations we see
> today using mapGroupsWithState/flatMapGroupsWithState operator. The
> detailed set of limitations is described in the SPIP doc.
>
> We propose to support various features such as multiple state variables
> (flexible data modeling), composite types, enhanced timer functionality,
> support for chaining operators after new operator, handling initial state
> along with state data source, schema evolution etc This will allow users to
> write more powerful streaming state management logic primarily used in
> operational use-cases. Other built-in stateful operators could also benefit
> from such changes in the future.
>
> JIRA: https://issues.apache.org/jira/browse/SPARK-45939
> SPIP:
> https://docs.google.com/document/d/1QtC5qd4WQEia9kl1Qv74WE0TiXYy3x6zeTykygwPWig/edit?usp=sharing
> Design Doc:
> https://docs.google.com/document/d/1QjZmNZ-fHBeeCYKninySDIoOEWfX6EmqXs2lK097u9o/edit?usp=sharing
>
> cc - @Jungtaek Lim   who has graciously
> agreed to be the shepherd for this project
>
> Looking forward to your feedback !
>
> Thanks,
> Anish
>


Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-15 Thread Jungtaek Lim
+1 (non-binding)

On Thu, Nov 16, 2023 at 4:23 AM Ilan Filonenko  wrote:

> +1 (non-binding)
>
> On Wed, Nov 15, 2023 at 12:57 PM Xiao Li  wrote:
>
>> +1
>>
>> bo yang  于2023年11月15日周三 05:55写道:
>>
>>> +1
>>>
>>> On Tue, Nov 14, 2023 at 7:18 PM huaxin gao 
>>> wrote:
>>>
 +1

 On Tue, Nov 14, 2023 at 10:45 AM Holden Karau 
 wrote:

> +1
>
> On Tue, Nov 14, 2023 at 10:21 AM DB Tsai  wrote:
>
>> +1
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Nov 14, 2023, at 10:14 AM, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>>
>> On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:
>>
>>> +1
>>>
>>> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou 
>>> wrote:
>>> > >
>>> > > +1(Non-binding)
>>> > >
>>> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh 
>>> wrote:
>>> > >>
>>> > >> Hi all,
>>> > >>
>>> > >> I’d like to start a vote for SPIP: An Official Kubernetes
>>> Operator for
>>> > >> Apache Spark.
>>> > >>
>>> > >> The proposal is to develop an official Java-based Kubernetes
>>> operator
>>> > >> for Apache Spark to automate the deployment and simplify the
>>> lifecycle
>>> > >> management and orchestration of Spark applications and Spark
>>> clusters
>>> > >> on k8s at prod scale.
>>> > >>
>>> > >> This aims to reduce the learning curve and operation overhead
>>> for
>>> > >> Spark users so they can concentrate on core Spark logic.
>>> > >>
>>> > >> Please also refer to:
>>> > >>
>>> > >>- Discussion thread:
>>> > >>
>>> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
>>> > >>- JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-45923
>>> > >>- SPIP doc:
>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>> > >>
>>> > >>
>>> > >> Please vote on the SPIP for the next 72 hours:
>>> > >>
>>> > >> [ ] +1: Accept the proposal as an official SPIP
>>> > >> [ ] +0
>>> > >> [ ] -1: I don’t think this is a good idea because …
>>> > >>
>>> > >>
>>> > >> Thank you!
>>> > >>
>>> > >> Liang-Chi Hsieh
>>> > >>
>>> > >>
>>> -
>>> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > >
>>> > > Zhou, Ye  周晔
>>> >
>>> >
>>> -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>


[VOTE][RESULT] SPIP: State Data Source - Reader

2023-10-25 Thread Jungtaek Lim
The vote passes with 9 +1s (4 binding +1s).
Thanks to all who reviews the SPIP doc and votes!

(* = binding)
+1:

- Jungtaek Lim

- Wenchen Fan (*)

- Anish Shrigondekar

- L. C. Hsieh (*)

- Jia Fan

- Bartosz Konieczny

- Yuanjian Li (*)

- Shixiong Zhu (*)

- Yuepeng Pan

+0: None

-1: None

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE] SPIP: State Data Source - Reader

2023-10-25 Thread Jungtaek Lim
Thanks all for participating! The vote passed. I'll send out the result to
a separate thread.

On Thu, Oct 26, 2023 at 10:52 AM Yuepeng Pan  wrote:

> +1 (non-binding)
>
> Regards,
> Roc
>
> At 2023-10-23 12:23:52, "Jungtaek Lim" 
> wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: State Data Source - Reader.
>
> The high level summary of the SPIP is that we propose a new data source
> which enables a read ability for state store in the checkpoint, via batch
> query. This would enable two major use cases 1) constructing tests with
> verifying state store 2) inspecting values in state store in the scenario
> of incident.
>
> References:
>
>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45511>
>- SPIP doc
>
> <https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing>
>- Discussion thread
><https://lists.apache.org/thread/l16cjqrpfbrlb8svhdw3qlfkh9pnlkcc>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>
>


Re: [VOTE] SPIP: State Data Source - Reader

2023-10-24 Thread Jungtaek Lim
Friendly reminder: the VOTE thread got 2 binding votes and needs 1 more
binding vote to pass.

On Wed, Oct 25, 2023 at 1:21 AM Bartosz Konieczny 
wrote:

> +1
>
> On Tuesday, October 24, 2023, Jia Fan  wrote:
>
>> +1
>>
>> L. C. Hsieh  于2023年10月24日周二 13:23写道:
>>
>>> +1
>>>
>>> On Mon, Oct 23, 2023 at 6:31 PM Anish Shrigondekar
>>>  wrote:
>>> >
>>> > +1 (non-binding)
>>> >
>>> > Thanks,
>>> > Anish
>>> >
>>> > On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>
>>> >>> Starting with my +1 (non-binding). Thanks!
>>> >>>
>>> >>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>>
>>> >>>> Hi all,
>>> >>>>
>>> >>>> I'd like to start the vote for SPIP: State Data Source - Reader.
>>> >>>>
>>> >>>> The high level summary of the SPIP is that we propose a new data
>>> source which enables a read ability for state store in the checkpoint, via
>>> batch query. This would enable two major use cases 1) constructing tests
>>> with verifying state store 2) inspecting values in state store in the
>>> scenario of incident.
>>> >>>>
>>> >>>> References:
>>> >>>>
>>> >>>> JIRA ticket
>>> >>>> SPIP doc
>>> >>>> Discussion thread
>>> >>>>
>>> >>>> Please vote on the SPIP for the next 72 hours:
>>> >>>>
>>> >>>> [ ] +1: Accept the proposal as an official SPIP
>>> >>>> [ ] +0
>>> >>>> [ ] -1: I don’t think this is a good idea because …
>>> >>>>
>>> >>>> Thanks!
>>> >>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Bartosz Konieczny
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>
>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-23 Thread Jungtaek Lim
FYI: VOTE thread is open, please check the link
https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1
(committer+ can
login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in
your inbox. Every vote would be really appreciated!

On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim 
wrote:

> I don't see major comments as of now. Given that the thread was initiated
> more than 10 days ago and I see multiple supporters, I'm going to initiate
> a VOTE thread.
>
> Please participate in the VOTE thread as well. Thanks!
>
> On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
>> is a rather general and usual question for every new addition of data
>> source. Hence I want to sort it out for everyone.
>>
>> As I know, the author implemented a third-party tool for query state
>>> store as a data source long time ago. I've suggested some users to use the
>>> tool before. It is a useful tool for special cases because there is no
>>> other tool/feature for the purpose.
>>> I think for such effort to add new data source, one usual question is
>>> why it has to be in Spark repo instead of as a third-party tool. Especially
>>> this is not a frequent used one. Even for structured stream users, only
>>> rare cases it is necessary to look into state store content.
>>
>>
>> I think we do not expect the data source to be used rarely. We see two
>> different major use cases; 1) unit tests against stateful query 2) look
>> into the state during the incident to get full context. 2) is probably not
>> something users may encounter this frequently, hence it is valid to say the
>> new feature may not be used frequently. But 1) is definitely something we
>> can say it's tied to daily work.
>>
>> Also, even 2), it looks to be an essential feature and has to be provided
>> as out-of-the-box. Let's say, this feature does not exist and an user
>> encounters an incident in production with a stateful query. During RCA,
>> they realize that state is a black-box and their only option is deducing
>> the value of the state indirectly, mostly likely requiring them to modify
>> the query heavily and put artificial inputs. If I were such a user, I would
>> consider this lack as a fundamental issue of SS. It has been out-of-the-box
>> in Flink for years (State Processor), so it also makes sense for
>> competitive points.
>>
>> We are seeing this effort as a stepping stone. As we see comments in SPIP
>> doc and also previous replies, people also see the proposal as a prior work
>> for writer part, which we would have a chance to break the strong
>> preconception for fixed number of shuffle partitions. I'd argue that this
>> is a rather fundamental limitation of SS and I have seen so many complaints
>> with this. I don't feel like it is right to delegate to a 3rd party to
>> solve the fundamental issue. This is probably stronger evidence than the
>> reader part.
>>
>> Here's another aspect, during the work, we observed the lacking parts on
>> checkpointing e.g. the information of prefix scan does not exist in the
>> checkpoint, which makes a big difference on restoring the state from the
>> state file. When we come to the state repartitioning, the repartition is
>> based on the grouping keys in the operator (not the state key), hence we
>> will also need additional information for that. If this feature goes into
>> the 3rd party, it will be very painful to make both sides of the changes
>> altogether. It brings up another headache, versioning and compatibility
>> matrix.
>>
>> I hope this would help persuade people to add this to the Spark repo
>> rather than its own life.
>>
>>
>> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks Raghu for your support!
>>>
>>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>>> support from Chaoqin and Praveen. Thanks both!
>>>
>>>
>>>
>>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <
>>> raghu.ang...@databricks.com> wrote:
>>>
>>>> +1 overall and a big +1 to keeping offline state-rebalancing as a
>>>> primary use case.
>>>>
>>>> Raghu.
>>>>
>>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>>> bartkoniec...@gmail.com> wrote:
>>>>
>>>>> Thank you, Jungtaek, for your answers! It's clea

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
Starting with my +1 (non-binding). Thanks!

On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: State Data Source - Reader.
>
> The high level summary of the SPIP is that we propose a new data source
> which enables a read ability for state store in the checkpoint, via batch
> query. This would enable two major use cases 1) constructing tests with
> verifying state store 2) inspecting values in state store in the scenario
> of incident.
>
> References:
>
>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45511>
>- SPIP doc
>
> <https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing>
>- Discussion thread
><https://lists.apache.org/thread/l16cjqrpfbrlb8svhdw3qlfkh9pnlkcc>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>


[VOTE] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
Hi all,

I'd like to start the vote for SPIP: State Data Source - Reader.

The high level summary of the SPIP is that we propose a new data source
which enables a read ability for state store in the checkpoint, via batch
query. This would enable two major use cases 1) constructing tests with
verifying state store 2) inspecting values in state store in the scenario
of incident.

References:

   - JIRA ticket <https://issues.apache.org/jira/browse/SPARK-45511>
   - SPIP doc
   
<https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing>
   - Discussion thread
   <https://lists.apache.org/thread/l16cjqrpfbrlb8svhdw3qlfkh9pnlkcc>

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
I don't see major comments as of now. Given that the thread was initiated
more than 10 days ago and I see multiple supporters, I'm going to initiate
a VOTE thread.

Please participate in the VOTE thread as well. Thanks!

On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim 
wrote:

> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
> is a rather general and usual question for every new addition of data
> source. Hence I want to sort it out for everyone.
>
> As I know, the author implemented a third-party tool for query state store
>> as a data source long time ago. I've suggested some users to use the tool
>> before. It is a useful tool for special cases because there is no other
>> tool/feature for the purpose.
>> I think for such effort to add new data source, one usual question is why
>> it has to be in Spark repo instead of as a third-party tool. Especially
>> this is not a frequent used one. Even for structured stream users, only
>> rare cases it is necessary to look into state store content.
>
>
> I think we do not expect the data source to be used rarely. We see two
> different major use cases; 1) unit tests against stateful query 2) look
> into the state during the incident to get full context. 2) is probably not
> something users may encounter this frequently, hence it is valid to say the
> new feature may not be used frequently. But 1) is definitely something we
> can say it's tied to daily work.
>
> Also, even 2), it looks to be an essential feature and has to be provided
> as out-of-the-box. Let's say, this feature does not exist and an user
> encounters an incident in production with a stateful query. During RCA,
> they realize that state is a black-box and their only option is deducing
> the value of the state indirectly, mostly likely requiring them to modify
> the query heavily and put artificial inputs. If I were such a user, I would
> consider this lack as a fundamental issue of SS. It has been out-of-the-box
> in Flink for years (State Processor), so it also makes sense for
> competitive points.
>
> We are seeing this effort as a stepping stone. As we see comments in SPIP
> doc and also previous replies, people also see the proposal as a prior work
> for writer part, which we would have a chance to break the strong
> preconception for fixed number of shuffle partitions. I'd argue that this
> is a rather fundamental limitation of SS and I have seen so many complaints
> with this. I don't feel like it is right to delegate to a 3rd party to
> solve the fundamental issue. This is probably stronger evidence than the
> reader part.
>
> Here's another aspect, during the work, we observed the lacking parts on
> checkpointing e.g. the information of prefix scan does not exist in the
> checkpoint, which makes a big difference on restoring the state from the
> state file. When we come to the state repartitioning, the repartition is
> based on the grouping keys in the operator (not the state key), hence we
> will also need additional information for that. If this feature goes into
> the 3rd party, it will be very painful to make both sides of the changes
> altogether. It brings up another headache, versioning and compatibility
> matrix.
>
> I hope this would help persuade people to add this to the Spark repo
> rather than its own life.
>
>
> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Thanks Raghu for your support!
>>
>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>> support from Chaoqin and Praveen. Thanks both!
>>
>>
>>
>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
>> wrote:
>>
>>> +1 overall and a big +1 to keeping offline state-rebalancing as a
>>> primary use case.
>>>
>>> Raghu.
>>>
>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>> bartkoniec...@gmail.com> wrote:
>>>
>>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>>
>>>> +1 for me. It seems like a prerequisite for further ops-related
>>>> improvements for the state store management. I mean especially here the
>>>> state rebalancing that could rely on this read+write state store API. I
>>>> don't mean here the dynamic state rebalancing that could probably be
>>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>>> thinking more of an offline job to rebalance the state and later restart
>>>> the stateful pipeline with the changed number of shuffle partitions.
>>>>
>>>> Best,
>>>> Bartosz.
>>>>
>>>&g

Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it is
a rather general and usual question for every new addition of data source.
Hence I want to sort it out for everyone.

As I know, the author implemented a third-party tool for query state store
> as a data source long time ago. I've suggested some users to use the tool
> before. It is a useful tool for special cases because there is no other
> tool/feature for the purpose.
> I think for such effort to add new data source, one usual question is why
> it has to be in Spark repo instead of as a third-party tool. Especially
> this is not a frequent used one. Even for structured stream users, only
> rare cases it is necessary to look into state store content.


I think we do not expect the data source to be used rarely. We see two
different major use cases; 1) unit tests against stateful query 2) look
into the state during the incident to get full context. 2) is probably not
something users may encounter this frequently, hence it is valid to say the
new feature may not be used frequently. But 1) is definitely something we
can say it's tied to daily work.

Also, even 2), it looks to be an essential feature and has to be provided
as out-of-the-box. Let's say, this feature does not exist and an user
encounters an incident in production with a stateful query. During RCA,
they realize that state is a black-box and their only option is deducing
the value of the state indirectly, mostly likely requiring them to modify
the query heavily and put artificial inputs. If I were such a user, I would
consider this lack as a fundamental issue of SS. It has been out-of-the-box
in Flink for years (State Processor), so it also makes sense for
competitive points.

We are seeing this effort as a stepping stone. As we see comments in SPIP
doc and also previous replies, people also see the proposal as a prior work
for writer part, which we would have a chance to break the strong
preconception for fixed number of shuffle partitions. I'd argue that this
is a rather fundamental limitation of SS and I have seen so many complaints
with this. I don't feel like it is right to delegate to a 3rd party to
solve the fundamental issue. This is probably stronger evidence than the
reader part.

Here's another aspect, during the work, we observed the lacking parts on
checkpointing e.g. the information of prefix scan does not exist in the
checkpoint, which makes a big difference on restoring the state from the
state file. When we come to the state repartitioning, the repartition is
based on the grouping keys in the operator (not the state key), hence we
will also need additional information for that. If this feature goes into
the 3rd party, it will be very painful to make both sides of the changes
altogether. It brings up another headache, versioning and compatibility
matrix.

I hope this would help persuade people to add this to the Spark repo rather
than its own life.


On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim 
wrote:

> Thanks Raghu for your support!
>
> Btw, I'd like to replicate the support from JIRA ticket itself, I see
> support from Chaoqin and Praveen. Thanks both!
>
>
>
> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
> wrote:
>
>> +1 overall and a big +1 to keeping offline state-rebalancing as a primary
>> use case.
>>
>> Raghu.
>>
>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>
>>> +1 for me. It seems like a prerequisite for further ops-related
>>> improvements for the state store management. I mean especially here the
>>> state rebalancing that could rely on this read+write state store API. I
>>> don't mean here the dynamic state rebalancing that could probably be
>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>> thinking more of an offline job to rebalance the state and later restart
>>> the stateful pipeline with the changed number of shuffle partitions.
>>>
>>> Best,
>>> Bartosz.
>>>
>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> bump for better reach
>>>>
>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Sorry, please use this link instead for SPIP doc:
>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>>
>>>>>
>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>&g

Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Thanks Raghu for your support!

Btw, I'd like to replicate the support from JIRA ticket itself, I see
support from Chaoqin and Praveen. Thanks both!



On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
wrote:

> +1 overall and a big +1 to keeping offline state-rebalancing as a primary
> use case.
>
> Raghu.
>
> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
> bartkoniec...@gmail.com> wrote:
>
>> Thank you, Jungtaek, for your answers! It's clear now.
>>
>> +1 for me. It seems like a prerequisite for further ops-related
>> improvements for the state store management. I mean especially here the
>> state rebalancing that could rely on this read+write state store API. I
>> don't mean here the dynamic state rebalancing that could probably be
>> implemented with a lower latency directly in the stateful API. Instead I'm
>> thinking more of an offline job to rebalance the state and later restart
>> the stateful pipeline with the changed number of shuffle partitions.
>>
>> Best,
>> Bartosz.
>>
>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> bump for better reach
>>>
>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Sorry, please use this link instead for SPIP doc:
>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>
>>>>
>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> I'd like to start a discussion on "State Data Source - Reader".
>>>>>
>>>>> This proposal aims to introduce a new data source "statestore" which
>>>>> enables reading the state rows from existing checkpoint via offline 
>>>>> (batch)
>>>>> query. This will enable users to 1) create unit tests against stateful
>>>>> query verifying the state value (especially flatMapGroupsWithState), 2)
>>>>> gather more context on the status when an incident occurs, especially for
>>>>> incorrect output.
>>>>>
>>>>> *SPIP*:
>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>>>
>>>>> Looking forward to your feedback!
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> ps. The scope of the project is narrowed to the reader in this SPIP,
>>>>> since the writer requires us to consider more cases. We are planning on 
>>>>> it.
>>>>>
>>>>
>>
>> --
>> Bartosz Konieczny
>> freelance data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Thanks Yuanjian for your support!

I've left a comment but to replicate here - I agree with your point. It's
really uneasy for a new feature to be stable from the initial version and
we might want to decide on breaking backward compatibility for
(semantic) bug fixes/improvements. Maybe we could mark the data source as
incubating/experimental and look for a couple of minor releases to see
whether the options/behaviors can be finalized.

On Wed, Oct 18, 2023 at 4:24 PM Yuanjian Li  wrote:

> +1, I have no issues with the practicality and value of this feature
> itself.
> I've left some comments concerning ongoing maintenance and
> compatibility-related matters, which we can continue to discuss.
>
> Jungtaek Lim  于2023年10月17日周二 05:23写道:
>
>> Thanks Bartosz and Anish for your support!
>>
>> I'll wait for a couple more days to see whether we can hear more voices
>> on this. We could probably look for initiating a VOTE thread if there is no
>> objection.
>>
>> On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar <
>> anish.shrigonde...@databricks.com> wrote:
>>
>>> Hi Jungtaek,
>>>
>>> Thanks for putting this together. +1 from me and looks good overall.
>>> Posted some minor comments/questions to the doc.
>>>
>>> Thanks,
>>> Anish
>>>
>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>> bartkoniec...@gmail.com> wrote:
>>>
>>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>>
>>>> +1 for me. It seems like a prerequisite for further ops-related
>>>> improvements for the state store management. I mean especially here the
>>>> state rebalancing that could rely on this read+write state store API. I
>>>> don't mean here the dynamic state rebalancing that could probably be
>>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>>> thinking more of an offline job to rebalance the state and later restart
>>>> the stateful pipeline with the changed number of shuffle partitions.
>>>>
>>>> Best,
>>>> Bartosz.
>>>>
>>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> bump for better reach
>>>>>
>>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Sorry, please use this link instead for SPIP doc:
>>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> I'd like to start a discussion on "State Data Source - Reader".
>>>>>>>
>>>>>>> This proposal aims to introduce a new data source "statestore" which
>>>>>>> enables reading the state rows from existing checkpoint via offline 
>>>>>>> (batch)
>>>>>>> query. This will enable users to 1) create unit tests against stateful
>>>>>>> query verifying the state value (especially flatMapGroupsWithState), 2)
>>>>>>> gather more context on the status when an incident occurs, especially 
>>>>>>> for
>>>>>>> incorrect output.
>>>>>>>
>>>>>>> *SPIP*:
>>>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>>>>>
>>>>>>> Looking forward to your feedback!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> ps. The scope of the project is narrowed to the reader in this SPIP,
>>>>>>> since the writer requires us to consider more cases. We are planning on 
>>>>>>> it.
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Bartosz Konieczny
>>>> freelance data engineer
>>>> https://www.waitingforcode.com
>>>> https://github.com/bartosz25/
>>>> https://twitter.com/waitingforcode
>>>>
>>>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-16 Thread Jungtaek Lim
Thanks Bartosz and Anish for your support!

I'll wait for a couple more days to see whether we can hear more voices on
this. We could probably look for initiating a VOTE thread if there is no
objection.

On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar <
anish.shrigonde...@databricks.com> wrote:

> Hi Jungtaek,
>
> Thanks for putting this together. +1 from me and looks good overall.
> Posted some minor comments/questions to the doc.
>
> Thanks,
> Anish
>
> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
> bartkoniec...@gmail.com> wrote:
>
>> Thank you, Jungtaek, for your answers! It's clear now.
>>
>> +1 for me. It seems like a prerequisite for further ops-related
>> improvements for the state store management. I mean especially here the
>> state rebalancing that could rely on this read+write state store API. I
>> don't mean here the dynamic state rebalancing that could probably be
>> implemented with a lower latency directly in the stateful API. Instead I'm
>> thinking more of an offline job to rebalance the state and later restart
>> the stateful pipeline with the changed number of shuffle partitions.
>>
>> Best,
>> Bartosz.
>>
>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> bump for better reach
>>>
>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Sorry, please use this link instead for SPIP doc:
>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>
>>>>
>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> I'd like to start a discussion on "State Data Source - Reader".
>>>>>
>>>>> This proposal aims to introduce a new data source "statestore" which
>>>>> enables reading the state rows from existing checkpoint via offline 
>>>>> (batch)
>>>>> query. This will enable users to 1) create unit tests against stateful
>>>>> query verifying the state value (especially flatMapGroupsWithState), 2)
>>>>> gather more context on the status when an incident occurs, especially for
>>>>> incorrect output.
>>>>>
>>>>> *SPIP*:
>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>>>
>>>>> Looking forward to your feedback!
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> ps. The scope of the project is narrowed to the reader in this SPIP,
>>>>> since the writer requires us to consider more cases. We are planning on 
>>>>> it.
>>>>>
>>>>
>>
>> --
>> Bartosz Konieczny
>> freelance data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-16 Thread Jungtaek Lim
bump for better reach

On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim 
wrote:

> Sorry, please use this link instead for SPIP doc:
> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>
>
> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I'd like to start a discussion on "State Data Source - Reader".
>>
>> This proposal aims to introduce a new data source "statestore" which
>> enables reading the state rows from existing checkpoint via offline (batch)
>> query. This will enable users to 1) create unit tests against stateful
>> query verifying the state value (especially flatMapGroupsWithState), 2)
>> gather more context on the status when an incident occurs, especially for
>> incorrect output.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. The scope of the project is narrowed to the reader in this SPIP,
>> since the writer requires us to consider more cases. We are planning on it.
>>
>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-12 Thread Jungtaek Lim
Sorry, please use this link instead for SPIP doc:
https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing


On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I'd like to start a discussion on "State Data Source - Reader".
>
> This proposal aims to introduce a new data source "statestore" which
> enables reading the state rows from existing checkpoint via offline (batch)
> query. This will enable users to 1) create unit tests against stateful
> query verifying the state value (especially flatMapGroupsWithState), 2)
> gather more context on the status when an incident occurs, especially for
> incorrect output.
>
> *SPIP*:
> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>
> Looking forward to your feedback!
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. The scope of the project is narrowed to the reader in this SPIP, since
> the writer requires us to consider more cases. We are planning on it.
>


[DISCUSS] SPIP: State Data Source - Reader

2023-10-12 Thread Jungtaek Lim
Hi dev,

I'd like to start a discussion on "State Data Source - Reader".

This proposal aims to introduce a new data source "statestore" which
enables reading the state rows from existing checkpoint via offline (batch)
query. This will enable users to 1) create unit tests against stateful
query verifying the state value (especially flatMapGroupsWithState), 2)
gather more context on the status when an incident occurs, especially for
incorrect output.

*SPIP*:
https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
*JIRA*: https://issues.apache.org/jira/browse/SPARK-45511

Looking forward to your feedback!

Thanks,
Jungtaek Lim (HeartSaVioR)

ps. The scope of the project is narrowed to the reader in this SPIP, since
the writer requires us to consider more cases. We are planning on it.


Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
slight correction/clarification: We now take the "previous" watermark to
determine the late record, because they are valid inputs for non-first
stateful operators dropping records based on the same criteria would drop
valid records from previous (upstream) stateful operators. Please look back
which criteria we use for evicting states, which could become outputs of
the operator.

On Tue, Oct 10, 2023 at 8:10 PM Jungtaek Lim 
wrote:

> We wouldn't like to expose the internal mechanism to the public.
>
> As you are a very detail oriented engineer tracking major changes, you
> might notice that we "changed" the definition of late record while fixing
> late records. Previously the late record is defined as a record having
> event time timestamp be earlier than the "current" watermark. How has it
> changed? We now take the "previous" watermark to determine the late record,
> because they are valid inputs for non-first stateful operators. If we were
> exposing the function current_watermark() which provides current watermark
> and users somehow build a side-output based on this, it would be broken
> when we introduce the fix on late record filtering. Or even worse, we may
> decide not to fix the issue worrying too much about existing workloads, and
> give up multiple stateful operators.
>
> The change is arguably not a breaking change, because we never guarantee
> that we won't process the data which is earlier than the watermark. The
> guarantee is one way, we guarantee that the record is processed if the
> event time of the record is later than the watermark. The opposite way is
> not guaranteed, and we actually documented this in the guide doc.
>
> So the workaround I mentioned cannot be used for capturing dropped late
> records - that does not work as expected. We will need to apply exactly the
> same criteria (probably the same predicate) on capturing them. We are aware
> of the demand for side-output of dropped late records, and I also agree
> that just having numbers of dropped records is never ideal.
>
> Let's see whether we have an opportunity to prioritize this. If you have
> an idea (sketched design) for implementing this, that should be awesome!
>
> On Tue, Oct 10, 2023 at 6:27 PM Bartosz Konieczny 
> wrote:
>
>> Thank you for the clarification, Jungtaek  Indeed, it doesn't sound
>> like a highly demanded feature from the end users, haven't seen that a lot
>> on StackOverflow or mailing lists. I was just curious about the reasons.
>>
>> Using the arbitrary stateful processing could be indeed a workaround! But
>> IMHO it would be easier to expose this watermark value from a function like
>> a current_watermark() and let the users do anything with the data. And
>> it wouldn't require having the state store overhead to deal with. The
>> function could simplify implementing the *side output pattern* where we
>> could process the on-time data differently from the late data, e.g. write
>> late data to a dedicated space in the lake and facilitate the backfilling
>> for the batch pipelines?
>>
>> With the current_watermark function it could be expressed as a simple:
>>
>> streamDataset.foreachBatch((dataframe, batchVersion) =>  {
>>   dataframe.cache()
>>   dataframe.filter(current_watermark() >
>> event_time_from_datafarame).writeTo("late_data")
>>   dataframe.filter(current_watermark() <=
>> event_time_from_datafarame).writeTo("on_time_data")
>> })
>>
>> A little bit as you can do with Apache Flink in fact:
>>
>> https://github.com/immerok/recipes/blob/main/late-data-to-sink/src/main/java/com/immerok/cookbook/LateDataToSeparateSink.java#L81
>>
>> WDYT?
>>
>> Best,
>> Bartosz.
>>
>> PS. Will be happy to contribute on that if the feature does make sense ;)
>>
>> On Tue, Oct 10, 2023 at 3:23 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Technically speaking, "late data" represents the data which cannot be
>>> processed due to the fact the engine threw out the state associated with
>>> the data already.
>>>
>>> That said, the only reason watermark does exist for streaming is to
>>> handle stateful operators. From the engine's point of view, there is no
>>> concept about "late data" for stateless query. It's something users have to
>>> leverage "filter" by themselves, without relying on the value of watermark.
>>> I guess someone may see some benefit of automatic tracking of trend for
>>> event time and want to define late data based on the watermark even in
>>> stateless query,

Re: Watermark on late data only

2023-10-10 Thread Jungtaek Lim
We wouldn't like to expose the internal mechanism to the public.

As you are a very detail oriented engineer tracking major changes, you
might notice that we "changed" the definition of late record while fixing
late records. Previously the late record is defined as a record having
event time timestamp be earlier than the "current" watermark. How has it
changed? We now take the "previous" watermark to determine the late record,
because they are valid inputs for non-first stateful operators. If we were
exposing the function current_watermark() which provides current watermark
and users somehow build a side-output based on this, it would be broken
when we introduce the fix on late record filtering. Or even worse, we may
decide not to fix the issue worrying too much about existing workloads, and
give up multiple stateful operators.

The change is arguably not a breaking change, because we never guarantee
that we won't process the data which is earlier than the watermark. The
guarantee is one way, we guarantee that the record is processed if the
event time of the record is later than the watermark. The opposite way is
not guaranteed, and we actually documented this in the guide doc.

So the workaround I mentioned cannot be used for capturing dropped late
records - that does not work as expected. We will need to apply exactly the
same criteria (probably the same predicate) on capturing them. We are aware
of the demand for side-output of dropped late records, and I also agree
that just having numbers of dropped records is never ideal.

Let's see whether we have an opportunity to prioritize this. If you have an
idea (sketched design) for implementing this, that should be awesome!

On Tue, Oct 10, 2023 at 6:27 PM Bartosz Konieczny 
wrote:

> Thank you for the clarification, Jungtaek  Indeed, it doesn't sound like
> a highly demanded feature from the end users, haven't seen that a lot on
> StackOverflow or mailing lists. I was just curious about the reasons.
>
> Using the arbitrary stateful processing could be indeed a workaround! But
> IMHO it would be easier to expose this watermark value from a function like
> a current_watermark() and let the users do anything with the data. And it
> wouldn't require having the state store overhead to deal with. The function
> could simplify implementing the *side output pattern* where we could
> process the on-time data differently from the late data, e.g. write late
> data to a dedicated space in the lake and facilitate the backfilling for
> the batch pipelines?
>
> With the current_watermark function it could be expressed as a simple:
>
> streamDataset.foreachBatch((dataframe, batchVersion) =>  {
>   dataframe.cache()
>   dataframe.filter(current_watermark() >
> event_time_from_datafarame).writeTo("late_data")
>   dataframe.filter(current_watermark() <=
> event_time_from_datafarame).writeTo("on_time_data")
> })
>
> A little bit as you can do with Apache Flink in fact:
>
> https://github.com/immerok/recipes/blob/main/late-data-to-sink/src/main/java/com/immerok/cookbook/LateDataToSeparateSink.java#L81
>
> WDYT?
>
> Best,
> Bartosz.
>
> PS. Will be happy to contribute on that if the feature does make sense ;)
>
> On Tue, Oct 10, 2023 at 3:23 AM Jungtaek Lim 
> wrote:
>
>> Technically speaking, "late data" represents the data which cannot be
>> processed due to the fact the engine threw out the state associated with
>> the data already.
>>
>> That said, the only reason watermark does exist for streaming is to
>> handle stateful operators. From the engine's point of view, there is no
>> concept about "late data" for stateless query. It's something users have to
>> leverage "filter" by themselves, without relying on the value of watermark.
>> I guess someone may see some benefit of automatic tracking of trend for
>> event time and want to define late data based on the watermark even in
>> stateless query, but personally I don't hear about the request so far.
>>
>> As a workaround you can leverage flatMapGroupsWithState which provides
>> the value of watermark for you, but I'd agree it's too heavyweight just to
>> do this. If we see consistent demand on it, we could probably look into it
>> and maybe introduce a new SQL function (which works only on streaming -
>> that's probably a major blocker on introduction) on it.
>>
>> On Mon, Oct 9, 2023 at 11:03 AM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've been analyzing the watermark propagation added in the 3.5.0
>>> recently and had to return to the basics of watermarks. One question is
>>> still unanswered in my head.
>>&g

Re: Watermark on late data only

2023-10-09 Thread Jungtaek Lim
Technically speaking, "late data" represents the data which cannot be
processed due to the fact the engine threw out the state associated with
the data already.

That said, the only reason watermark does exist for streaming is to handle
stateful operators. From the engine's point of view, there is no concept
about "late data" for stateless query. It's something users have to
leverage "filter" by themselves, without relying on the value of watermark.
I guess someone may see some benefit of automatic tracking of trend for
event time and want to define late data based on the watermark even in
stateless query, but personally I don't hear about the request so far.

As a workaround you can leverage flatMapGroupsWithState which provides the
value of watermark for you, but I'd agree it's too heavyweight just to do
this. If we see consistent demand on it, we could probably look into it and
maybe introduce a new SQL function (which works only on streaming - that's
probably a major blocker on introduction) on it.

On Mon, Oct 9, 2023 at 11:03 AM Bartosz Konieczny 
wrote:

> Hi,
>
> I've been analyzing the watermark propagation added in the 3.5.0 recently
> and had to return to the basics of watermarks. One question is still
> unanswered in my head.
>
> Why are the watermarks reserved to stateful queries? Can't they apply to
> the filtering late date out only?
>
> The reason is only historical, as the initial design doc
> 
> mentions the aggregated queries exclusively? Or are there any technical
> limitations why writing the jobs like below don't drop late data
> automatically?
>
> import sparkSession.implicits._
> implicit val sparkContext = sparkSession.sqlContext
> val clicksStream = MemoryStream[Click]
> val clicksWithWatermark = clicksStream.toDF
>   .withWatermark("clickTime", "10 minutes")
> val query =
> clicksWithWatermark.writeStream.format("console").option("truncate", false)
>   .start()
>
> clicksStream.addData(Seq(
>   Click(1, Timestamp.valueOf("2023-06-10 10:10:00")),
>   Click(2, Timestamp.valueOf("2023-06-10 10:12:00")),
>   Click(3, Timestamp.valueOf("2023-06-10 10:14:00"))
> ))
>
>
> query.processAllAvailable()
>
> clicksStream.addData(Seq(
>   Click(4, Timestamp.valueOf("2023-06-10 11:00:40")),
>   Click(5, Timestamp.valueOf("2023-06-10 11:00:30")),
>   Click(6, Timestamp.valueOf("2023-06-10 11:00:10")),
>   Click(10, Timestamp.valueOf("2023-06-10 10:00:10"))
> ))
> query.processAllAvailable()
>
> One quick implementation could be adding a new physical plan rule to the
> IncrementalExecution
> 
> for the EventTimeWatermark node. That's a first thought, maybe too
> simplistic and hiding some pitfalls?
>
> Best,
> Bartosz.
> --
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>


Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-04 Thread Jungtaek Lim
Congrats!

2023년 10월 4일 (수) 오후 5:04, yangjie01 님이 작성:

> Congratulations!
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Dongjoon Hyun 
> *日期**: *2023年10月4日 星期三 13:04
> *收件人**: *Hyukjin Kwon 
> *抄送**: *Hussein Awala , Rui Wang ,
> Gengliang Wang , Xiao Li , "
> dev@spark.apache.org" 
> *主题**: *Re: Welcome to Our New Apache Spark Committer and PMCs
>
>
>
> Congratulations!
>
>
>
> Dongjoon.
>
>
>
> On Tue, Oct 3, 2023 at 5:25 PM Hyukjin Kwon  wrote:
>
> Woohoo!
>
>
>
> On Tue, 3 Oct 2023 at 22:47, Hussein Awala  wrote:
>
> Congrats to all of you!
>
>
>
> On Tue 3 Oct 2023 at 08:15, Rui Wang  wrote:
>
> Congratulations! Well deserved!
>
>
>
> -Rui
>
>
>
>
>
> On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang  wrote:
>
> Congratulations to all! Well deserved!
>
>
>
> On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:
>
> Hi all,
>
> The Spark PMC is delighted to announce that we have voted to add one new
> committer and two new PMC members. These individuals have consistently
> contributed to the project and have clearly demonstrated their expertise.
>
> New Committer:
> - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>
> New PMCs:
> - Yuanjian Li
> - Yikun Jiang
>
> Please join us in extending a warm welcome to them in their new roles!
>
> Sincerely,
> The Spark PMC
>
>


[DISCUSS] Porting back SPARK-45178 to 3.5/3.4 version lines

2023-09-20 Thread Jungtaek Lim
Hi devs,

I'd like to get some inputs for dealing with the possible correctness issue
we figured. The JIRA ticket is SPARK-45178
<https://issues.apache.org/jira/browse/SPARK-45178> and I described the
issue and solution I proposed.

Context:
Source might behave incorrectly leading to correctness issues if it does
not support Trigger.AvailableNow and users set the trigger to
Trigger.AvailableNow. This is due to the incompatibility between fallback
implementation of Trigger.AvailableNow and the source implementation. As a
solution, we want to fall back to single back execution instead for such
cases.

The proposal is approved and merged in master branch (I guess there is no
issue as it's a major release), but since this introduces a behavioral
change, I'd like to hear voices on whether we want to introduce a
behavioral change in bugfix versions to address possible correctness, or
leave these version lines as they are.

Looking for voices on this.

Thanks in advance!
Jungtaek Lim (HeartSaVioR)


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Jungtaek Lim
+1 (non-binding)

Thanks for driving this release and the patience on multiple RCs!

On Tue, Sep 12, 2023 at 10:00 AM Yuanjian Li  wrote:

> +1 (non-binding)
>
> Yuanjian Li  于2023年9月11日周一 09:36写道:
>
>> @Peter Toth  I've looked into the details of this
>> issue, and it appears that it's neither a regression in version 3.5.0 nor a
>> correctness issue. It's a bug related to a new feature. I think we can fix
>> this in 3.5.1 and list it as a known issue of the Scala client of Spark
>> Connect in 3.5.0.
>>
>> Mridul Muralidharan  于2023年9月10日周日 04:12写道:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
>>> wrote:
>>>
 Please vote on releasing the following candidate(RC5) as Apache Spark
 version 3.5.0.

 The vote is open until 11:59pm Pacific time Sep 11th and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.5.0

 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is v3.5.0-rc5 (commit
 ce5ddad990373636e94071e7cef2f31021add07b):

 https://github.com/apache/spark/tree/v3.5.0-rc5

 The release files, including signatures, digests, etc. can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/

 Signatures used for Spark RCs can be found in this file:

 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1449

 The documentation corresponding to this release can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/

 The list of bug fixes going into 3.5.0 can be found at the following
 URL:

 https://issues.apache.org/jira/projects/SPARK/versions/12352848

 This release is using the release script of the tag v3.5.0-rc5.


 FAQ

 =

 How can I help test this release?

 =

 If you are a Spark user, you can help us test this release by taking

 an existing Spark workload and running on this release candidate, then

 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install

 the current RC and see if anything important breaks, in the Java/Scala

 you can add the staging repository to your projects resolvers and test

 with the RC (make sure to clean up the artifact cache before/after so

 you don't end up building with an out of date RC going forward).

 ===

 What should happen to JIRA tickets still targeting 3.5.0?

 ===

 The current list of open tickets targeted at 3.5.0 can be found at:

 https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 3.5.0

 Committers should look at those and triage. Extremely important bug

 fixes, documentation, and API tweaks that impact compatibility should

 be worked on immediately. Everything else please retarget to an

 appropriate release.

 ==

 But my bug isn't fixed?

 ==

 In order to make timely releases, we will typically not hold the

 release unless the bug in question is a regression from the previous

 release. That being said, if there is something which is a regression

 that has not been correctly targeted please ping me or a committer to

 help target the issue.

 Thanks,

 Yuanjian Li

>>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Jungtaek Lim
+1 (non-binding)

Thanks for driving this release!

On Fri, Sep 8, 2023 at 11:29 AM Holden Karau  wrote:

> +1 pip installing seems to function :)
>
> On Thu, Sep 7, 2023 at 7:22 PM Yuming Wang  wrote:
>
>> +1.
>>
>> On Thu, Sep 7, 2023 at 10:33 PM yangjie01 
>> wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Gengliang Wang 
>>> *日期**: *2023年9月7日 星期四 12:53
>>> *收件人**: *Yuanjian Li 
>>> *抄送**: *Xiao Li , "her...@databricks.com.invalid"
>>> , Spark dev list 
>>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC4)
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, Sep 6, 2023 at 9:46 PM Yuanjian Li 
>>> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> Xiao Li  于2023年9月6日周三 15:27写道:
>>>
>>> +1
>>>
>>>
>>>
>>> Xiao
>>>
>>>
>>>
>>> Herman van Hovell  于2023年9月6日周三 22:08写道:
>>>
>>> Tested connect, and everything looks good.
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li 
>>> wrote:
>>>
>>> Please vote on releasing the following candidate(RC4) as Apache Spark
>>> version 3.5.0.
>>>
>>>
>>>
>>> The vote is open until 11:59pm Pacific time *Sep 8th* and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>>
>>>
>>> The tag to be voted on is v3.5.0-rc4 (commit
>>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>>
>>>
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>>
>>>
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>>
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>>
>>>
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>>
>>>
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>>
>>>
>>> This release is using the release script of the tag v3.5.0-rc4.
>>>
>>>
>>>
>>> FAQ
>>>
>>>
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>>
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>>
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>>
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>>
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-01 Thread Jungtaek Lim
My apologies, I have to add another ticket for a blocker, SPARK-45045
<https://issues.apache.org/jira/browse/SPARK-45045>. That said, I'm -1
(non-binding).

SPARK-43183 <https://issues.apache.org/jira/browse/SPARK-43183> made a
behavioral change regarding the StreamingQueryListener as well as
StreamingQuery API as a side-effect, while the intention was more about
introducing the change in the former one. I just got some reports that the
behavioral change for StreamingQuery API broke various tests in 3rd party
data sources. To help 3rd party ecosystems to adopt 3.5 without hassle, I'd
like to see this be fixed in 3.5.0.

There is no fix yet but I'm working on it. I'll give an update here. Maybe
we could lower down priority and let the release go with describing this
as a "known issue", if I couldn't make progress in a couple of days. I'm
sorry about that.

Thanks,
Jungtaek Lim

On Fri, Sep 1, 2023 at 12:12 PM Wenchen Fan  wrote:

> Sorry for the last-minute bug report, but we found a regression in 3.5:
> the SQL INSERT command without a column list fills missing columns with
> NULL while Spark 3.4 does not allow it. According to the SQL standard, this
> shouldn't be allowed and thus a regression in 3.5.
>
> The fix has been merged but one day after the RC3 cut:
> https://github.com/apache/spark/pull/42393 . I'm -1 and let's include
> this fix in 3.5.
>
> Thanks,
> Wenchen
>
> On Thu, Aug 31, 2023 at 9:09 PM Ian Manning 
> wrote:
>
>> +1 (non-binding)
>>
>> Using Spark Core, Spark SQL, Structured Streaming.
>>
>> On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC3) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc3 (commit
>>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc3.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>


Re: Welcome two new Apache Spark committers

2023-08-06 Thread Jungtaek Lim
Congrats Peter and Xiduo!

On Mon, Aug 7, 2023 at 11:33 AM yangjie01 
wrote:

> Congratulations, Peter and Xiduo ~
>
>
>
> *发件人**: *Hyukjin Kwon 
> *日期**: *2023年8月7日 星期一 10:30
> *收件人**: *Ruifeng Zheng 
> *抄送**: *Xiao Li , Debasish Das <
> debasish.da...@gmail.com>, Wenchen Fan , Spark dev
> list 
> *主题**: *Re: Welcome two new Apache Spark committers
>
>
>
> Woohoo!
>
>
>
> On Mon, 7 Aug 2023 at 11:28, Ruifeng Zheng  wrote:
>
> Congratulations! Peter and Xiduo!
>
>
>
> On Mon, Aug 7, 2023 at 10:13 AM Xiao Li  wrote:
>
> Congratulations, Peter and Xiduo!
>
>
>
>
>
>
>
> Debasish Das  于2023年8月6日周日 19:08写道:
>
> Congratulations Peter and Xidou.
>
> On Sun, Aug 6, 2023, 7:05 PM Wenchen Fan  wrote:
>
> Hi all,
>
>
>
> The Spark PMC recently voted to add two new committers. Please join me in
> welcoming them to their new role!
>
>
>
> - Peter Toth (Spark SQL)
>
> - Xiduo You (Spark SQL)
>
>
>
> They consistently make contributions to the project and clearly showed
> their expertise. We are very excited to have them join as committers.
>
>


Re: [VOTE][SPIP] Python Data Source API

2023-07-10 Thread Jungtaek Lim
Just to be fully sure, SPIP does not cover streaming, but if the
performance is not great compared to the JVM based implementation in any
way (which I expect so), I don't think it's good to integrate with
streaming which targets lower latency. That's the reason I gave +1 although
it's not covering streaming.

On Tue, Jul 11, 2023 at 8:35 AM Matei Zaharia 
wrote:

> +1
>
> On Jul 10, 2023, at 10:19 AM, Takuya UESHIN 
> wrote:
>
> +1
>
> On Sun, Jul 9, 2023 at 10:05 PM Ruifeng Zheng  wrote:
>
>> +1
>>
>> On Mon, Jul 10, 2023 at 8:20 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Sat, Jul 8, 2023 at 4:13 AM Reynold Xin 
>>> wrote:
>>>
>>>> +1!
>>>>
>>>>
>>>> On Fri, Jul 7 2023 at 11:58 AM, Holden Karau 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 for me
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Solutions Architect/Engineering Lead
>>>>>>> Palantir Technologies Limited
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 7 Jul 2023 at 11:05, Martin Grund
>>>>>>>  wrote:
>>>>>>>
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Fri, Jul 7, 2023 at 12:05 AM Denny Lee 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1 (non-binding)
>>>>>>>>>
>>>>>>>>> On Fri, Jul 7, 2023 at 00:50 Maciej 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +0
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Maciej Szymkiewicz
>>>>>>>>>>
>>>>>>>>>> Web: https://zero323.net
>>>>>>>>>> PGP: A30CEF0C31A501EC
>>>>>>>>>>
>>>>>>>>>> On 7/6/23 17:41, Xiao Li wrote:
>>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> Xiao
>>>>>>>>>>
>>>>>>>>>> Hyukjin Kwon  于2023年7月5日周三 17:28写道:
>>>>>>>>>>
>>>>>>>>>>> +1.
>>>>>>>>>>>
>>>>>>>>>>> See https://youtu.be/yj7XlTB1Jvc?t=604 :-).
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 6 Jul 2023 at 09:15, Allison Wang
>>>>>>>>>>> 
>>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to start the vote for SPIP: Python Data Source API.
>>>>>>>>>>>>
>>>>>>>>>>>> The high-level summary for the SPIP is that it aims to
>>>>>>>>>>>> introduce a simple API in Python for Data Sources. The idea is to 
>>>>>>>>>>>> enable
>>>>>>>>>>>> Python developers to create data sources without learning Scala or 
>>>>>>>>>>>> dealing
>>>>>>>>>>>> with the complexities of the current data source APIs. This would 
>>>>>>>>>>>> make
>>>>>>>>>>>> Spark more accessible to the wider Python developer community.
>>>>>>>>>>>>
>>>>>>>>>>>> References:
>>>>>>>>>>>>
>>>>>>>>>>>>- SPIP doc
>>>>>>>>>>>>
>>>>>>>>>>>> <https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing>
>>>>>>>>>>>>- JIRA ticket
>>>>>>>>>>>><https://issues.apache.org/jira/browse/SPARK-44076>
>>>>>>>>>>>>- Discussion thread
>>>>>>>>>>>>
>>>>>>>>>>>> <https://lists.apache.org/thread/w621zn14ho4rw61b0s139klnqh900s8y>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please vote on the SPIP for the next 72 hours:
>>>>>>>>>>>>
>>>>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>>>>> [ ] +0
>>>>>>>>>>>> [ ] -1: I don’t think this is a good idea because __.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Allison
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>
> --
> Takuya UESHIN
>
>
>


Re: [VOTE][SPIP] Python Data Source API

2023-07-09 Thread Jungtaek Lim
+1

On Sat, Jul 8, 2023 at 4:13 AM Reynold Xin 
wrote:

> +1!
>
>
> On Fri, Jul 7 2023 at 11:58 AM, Holden Karau 
> wrote:
>
>> +1
>>
>> On Fri, Jul 7, 2023 at 9:55 AM huaxin gao  wrote:
>>
>>> +1
>>>
>>> On Fri, Jul 7, 2023 at 8:59 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 +1 for me

 Mich Talebzadeh,
 Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 7 Jul 2023 at 11:05, Martin Grund 
 wrote:

> +1 (non-binding)
>
> On Fri, Jul 7, 2023 at 12:05 AM Denny Lee 
> wrote:
>
>> +1 (non-binding)
>>
>> On Fri, Jul 7, 2023 at 00:50 Maciej  wrote:
>>
>>> +0
>>>
>>> Best regards,
>>> Maciej Szymkiewicz
>>>
>>> Web: https://zero323.net
>>> PGP: A30CEF0C31A501EC
>>>
>>> On 7/6/23 17:41, Xiao Li wrote:
>>>
>>> +1
>>>
>>> Xiao
>>>
>>> Hyukjin Kwon  于2023年7月5日周三 17:28写道:
>>>
 +1.

 See https://youtu.be/yj7XlTB1Jvc?t=604 :-).

 On Thu, 6 Jul 2023 at 09:15, Allison Wang
 
  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Python Data Source API.
>
> The high-level summary for the SPIP is that it aims to introduce
> a simple API in Python for Data Sources. The idea is to enable Python
> developers to create data sources without learning Scala or dealing 
> with
> the complexities of the current data source APIs. This would make 
> Spark
> more accessible to the wider Python developer community.
>
> References:
>
>- SPIP doc
>
> 
>- JIRA ticket
>
>- Discussion thread
>
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because __.
>
> Thanks,
> Allison
>

>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Time for Spark v3.5.0 release

2023-07-04 Thread Jungtaek Lim
+1

On Wed, Jul 5, 2023 at 2:23 AM L. C. Hsieh  wrote:

> +1
>
> Thanks Yuanjian.
>
> On Tue, Jul 4, 2023 at 7:45 AM yangjie01  wrote:
> >
> > +1
> >
> >
> >
> > 发件人: Maxim Gekk 
> > 日期: 2023年7月4日 星期二 17:24
> > 收件人: Kent Yao 
> > 抄送: "dev@spark.apache.org" 
> > 主题: Re: Time for Spark v3.5.0 release
> >
> >
> >
> > +1
> >
> > On Tue, Jul 4, 2023 at 11:55 AM Kent Yao  wrote:
> >
> > +1, thank you
> >
> > Kent
> >
> > On 2023/07/04 05:32:52 Dongjoon Hyun wrote:
> > > +1
> > >
> > > Thank you, Yuanjian
> > >
> > > Dongjoon
> > >
> > > On Tue, Jul 4, 2023 at 1:03 AM Hyukjin Kwon 
> wrote:
> > >
> > > > Yeah one day postponed shouldn't be a big deal.
> > > >
> > > > On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li 
> wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> According to the Spark versioning policy at
> > > >> https://spark.apache.org/versioning-policy.html, should we cut
> > > >> *branch-3.5* on *July 17th, 2023*? (We initially proposed January
> 16th,
> > > >> but since it's a Sunday, I suggest we postpone it by one day).
> > > >>
> > > >> I would like to volunteer as the release manager for Apache Spark
> 3.5.0.
> > > >>
> > > >> Best,
> > > >> Yuanjian
> > > >>
> > > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)

2023-06-12 Thread Jungtaek Lim
I concur with Holden and Mridul. Let's build a plan before we call the
tentative deadline. I understand setting the tentative deadline would
definitely help in pushing back features which "never ever ends", but at
least we may want to list up features and discuss for priority. It is still
possible that we might even want to see some features as hard blocker on
the release for any reason, based on discussion of course.

On Tue, Jun 13, 2023 at 10:58 AM Mridul Muralidharan 
wrote:

>
> I agree with Holden, we should have some understanding of what we are
> targeting for 4.0, given it is a major ver bump - and work from there on
> the release date.
>
> Regards,
> Mridul
>
> On Mon, Jun 12, 2023 at 8:53 PM Jia Fan  wrote:
>
>> By the way, like Holden said, what's big feature for 4.0.0? I think very
>> big version change always bring some different.
>>
>> Jia Fan  于2023年6月13日周二 08:25写道:
>>
>>> +1
>>>
>>> 
>>>
>>> Jia Fan
>>>
>>>
>>>
>>> 2023年6月13日 03:51,Chao Sun  写道:
>>>
>>> +1
>>>
>>> On Mon, Jun 12, 2023 at 12:50 PM kazuyuki tanimura
>>>  wrote:
>>>
 +1 (non-binding)

 Thank you!
 Kazu


 On Jun 12, 2023, at 11:32 AM, Holden Karau 
 wrote:

 -0

 I'd like to see more of a doc around what we're planning on for a 4.0
 before we pick a target release date etc. (feels like cart before the
 horse).

 But it's a weak preference.

 On Mon, Jun 12, 2023 at 11:24 AM Xiao Li  wrote:

> Thanks for starting the vote.
>
> I do have a concern about the target release date of Spark 4.0.
>
> L. C. Hsieh  于2023年6月12日周一 11:09写道:
>
>> +1
>>
>> On Mon, Jun 12, 2023 at 11:06 AM huaxin gao 
>> wrote:
>> >
>> > +1
>> >
>> > On Mon, Jun 12, 2023 at 11:05 AM Dongjoon Hyun 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> Dongjoon
>> >>
>> >> On 2023/06/12 18:00:38 Dongjoon Hyun wrote:
>> >> > Please vote on the release plan for Apache Spark 4.0.0.
>> >> >
>> >> > The vote is open until June 16th 1AM (PST) and passes if a
>> majority +1 PMC
>> >> > votes are cast, with a minimum of 3 +1 votes.
>> >> >
>> >> > [ ] +1 Have a release plan for Apache Spark 4.0.0 (June 2024)
>> >> > [ ] -1 Do not have a plan for Apache Spark 4.0.0 because ...
>> >> >
>> >> > ===
>> >> > Apache Spark 4.0.0 Release Plan
>> >> > ===
>> >> >
>> >> > 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master
>> branch.
>> >> >
>> >> > 2. Creating `branch-4.0` on April 1st, 2024.
>> >> >
>> >> > 3. Apache Spark 4.0.0 RC1 on May 1st, 2024.
>> >> >
>> >> > 4. Apache Spark 4.0.0 Release in June, 2024.
>> >> >
>> >>
>> >>
>> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau



>>>


Re: ASF policy violation and Scala version issues

2023-06-11 Thread Jungtaek Lim
Are we concerned that a library does not release a new version which bumps
the Scala version, which the Scala version is announced in less than a week?
Shall we respect the efforts of all maintainers of open source projects we
use as dependencies, regardless whether they are ASF projects or
individuals? Individual projects consist of volunteers (unlike projects
which are backed by small and big companies). Please remember they have
their daily job different from these projects.

Also, if you look at the thread for 2.13.11
,
they found two regressions in only 3 days, even before they announced the
version. Bumping a bugfix version is not always safe, especially for Scala
where they use semver as one level down - their minor version is almost
another's major version (similar amount of pain on upgrading).

Btw, I see this is an effort of supporting JDK 21, but GA of JDK 21 is
planned for September 19, according to the post in InfoQ
. Do we
need to be coupled with a Java version which is not even released yet?
Shall we postpone this to Spark 4.0, as we say supporting JDK 21 is a
stretched goal for Spark 3.5 rather than a blocker?
This is not a complete view, but one post about JDK usage among LTS versions

shows that JDK 17 is still less than 10% although it was released 1.5 years
ago, and in last year it was less than 0.5%. In the real world, Java 11 is
still a majority and growing up, and 17 is slowly catching up. Even though
JDK 21 will be released tomorrow, we will have more than one year to
support it.



On Mon, Jun 12, 2023 at 4:54 AM Dongjoon Hyun 
wrote:

> Yes, that's exactly the pain point. I totally agree with you.
> For now, we are focusing on other stuffs more, but we need to resolve this
> situation soon.
>
> Dongjoon.
>
>
> On Sun, Jun 11, 2023 at 1:21 AM yangjie01  wrote:
>
>> Perhaps we should reconsider our reliance on and use of Ammonite? There
>> are still no new available versions of Ammonite one week after the release
>> of Scala 2.12.18 and 2.13.11. The question related to version release in
>> the Ammonite community also did not receive a response, which makes me feel
>> this is unexpected. Of course, we can also wait for a while before making a
>> decision.
>>
>>
>>
>> ```
>>
>> Scala version upgrade is blocked by the Ammonite library dev cycle
>> currently.
>>
>> Although we discussed it here and it had good intentions,
>> the current master branch cannot use the latest Scala.
>>
>> - https://lists.apache.org/thread/4nk5ddtmlobdt8g3z8xbqjclzkhlsdfk
>> 
>> "Ammonite as REPL for Spark Connect"
>>  SPARK-42884 Add Ammonite REPL integration
>>
>> Specifically, the following are blocked and I'm monitoring the
>> Ammonite repository.
>> - SPARK-40497 Upgrade Scala to 2.13.11
>> - SPARK-43832 Upgrade Scala to 2.12.18
>> - According to https://github.com/com-lihaoyi/Ammonite/issues
>> 
>>  ,
>>   Scala 3.3.0 LTS support also looks infeasible.
>>
>> Although we may be able to wait for a while, there are two
>> fundamental solutions
>> to unblock this situation in a long-term maintenance perspective.
>> - Replace it with a Scala-shell based implementation
>> - Move `connector/connect/client/jvm/pom.xml` outside from Spark repo.
>>Maybe, we can put it into the new repo like Rust and Go client.
>>
>> ```
>>
>> *发件人**: *Grisha Weintraub 
>> *日期**: *2023年6月8日 星期四 04:05
>> *收件人**: *Dongjoon Hyun 
>> *抄送**: *Nan Zhu , Sean Owen , "
>> dev@spark.apache.org" 
>> *主题**: *Re: ASF policy violation and Scala version issues
>>
>>
>>
>> Dongjoon,
>>
>>
>>
>> I followed the conversation, and in my opinion, your concern is totally
>> legit.
>> It just feels that the discussion is focused solely on Databricks, and as
>> I said above, the same issue occurs in other vendors as well.
>>
>>
>>
>>
>>
>> On Wed, Jun 7, 2023 at 10:28 PM Dongjoon Hyun 
>> wrote:
>>
>> To Grisha, we are talking about what is the right way and how to comply
>> with ASF legal advice which I shared in this thread from "legal-discuss@"
>> mailing thread.
>>
>>
>>
>> https://lists.apache.org/thread/mzhggd0rpz8t4d7vdsbhkp38mvd3lty4
>> 
>>  (legal-discuss@)
>>
>> https://www.apache.org/foundation/marks/downstream.html#source
>> 
>>  (ASF
>> 

Re: JDK version support policy?

2023-06-07 Thread Jungtaek Lim
+1 to drop Java 8 but +1 to set the lowest support version to Java 11.

Considering the phase for only security updates, 11 LTS would not be EOLed
in very long time. Unless that’s coupled with other deps which require
bumping JDK version (hope someone can bring up lists), it doesn’t seem to
buy much. And given the strong backward compatibility JDK provides, that’s
less likely.

Purely from the project’s source code view, does anyone know how much
benefits we can leverage for picking up 17 rather than 11? I lost the
track, but some of their proposals are more likely catching up with other
languages, which don’t make us be happy since Scala provides them for years.

2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성:

> I also generally perceive that, after Java 9, there is much less breaking
> change. So working on Java 11 probably means it works on 20, or can be
> easily made to without pain. Like I think the tweaks for Java 17 were quite
> small.
>
> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much.
> Keeping the support probably doesn't interfere with working on much newer
> JVMs either.
>
> On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:
>
>> So JDK 11 is still supported in open JDK until 2026, I'm not sure if
>> we're going to see enough folks moving to JRE17 by the Spark 4 release
>> unless we have a strong benefit from dropping 11 support I'd be inclined to
>> keep it.
>>
>> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>>
>>> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>>>
>>> Dongjoon.
>>>
>>> On 2023/06/07 02:42:19 yangjie01 wrote:
>>> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only
>>> support Java 17 and the upcoming Java 21.
>>> >
>>> > 发件人: Denny Lee 
>>> > 日期: 2023年6月7日 星期三 07:10
>>> > 收件人: Sean Owen 
>>> > 抄送: David Li , "dev@spark.apache.org" <
>>> dev@spark.apache.org>
>>> > 主题: Re: JDK version support policy?
>>> >
>>> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the
>>> fast-paced (positive) updates to Arrow, eh?!
>>> >
>>> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen >> sro...@gmail.com>> wrote:
>>> > I haven't followed this discussion closely, but I think we
>>> could/should drop Java 8 in Spark 4.0, which is up next after 3.5?
>>> >
>>> > On Tue, Jun 6, 2023 at 2:44 PM David Li >> lidav...@apache.org>> wrote:
>>> > Hello Spark developers,
>>> >
>>> > I'm from the Apache Arrow project. We've discussed Java version
>>> support [1], and crucially, whether to continue supporting Java 8 or not.
>>> As Spark is a big user of Arrow in Java, I was curious what Spark's policy
>>> here was.
>>> >
>>> > If Spark intends to stay on Java 8, for instance, we may also want to
>>> stay on Java 8 or otherwise provide some supported version of Arrow for
>>> Java 8.
>>> >
>>> > We've seen dependencies dropping or planning to drop support. gRPC may
>>> drop Java 8 at any time [2], possibly this September [3], which may affect
>>> Spark (due to Spark Connect). And today we saw that Arrow had issues
>>> running tests with Mockito on Java 20, but we couldn't update Mockito since
>>> it had dropped Java 8 support. (We pinned the JDK version in that CI
>>> pipeline for now.)
>>> >
>>> > So at least, I am curious if Arrow could start the long process of
>>> migrating Java versions without impacting Spark, or if we should continue
>>> to cooperate. Arrow Java doesn't see quite so much activity these days, so
>>> it's not quite critical, but it's possible that these dependency issues
>>> will start to affect us more soon. And looking forward, Java is working on
>>> APIs that should also allow us to ditch the --add-opens flag requirement
>>> too.
>>> >
>>> > [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj<
>>> https://mailshield.baidu.com/check?q=Nz%2bGj2hdKguk92URjA7sg0PfbSN%2fXUIMgrHTmW45gOOKEr3Shre45B7TRzhEpb%2baVsnyuRL%2fl%2f0cu7IVGHunSGDVnxM%3d
>>> >
>>> > [2]:
>>> https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md<
>>> https://mailshield.baidu.com/check?q=s89S3eo8GCJkV7Mpx7aG1SXId7uCRYGjQMA6DeLuX9duS86LhIODZMJfeFdGMWdFzJ8S7minyHoC7mCrzHagbJXCXYTBH%2fpZBpfTbw%3d%3d
>>> >
>>> > [3]: https://github.com/grpc/grpc-java/issues/9386<
>>> https://mailshield.baidu.com/check?q=R0HtWZIkY5eIxpz8jtqHLzd0ugNbcaXIKW2LbUUxpIn0t9Y9yAhuHPuZ4buryfNwRnnJTA%3d%3d
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Jungtaek Lim
Shall we initiate a new discussion thread for Scala 2.13 by default? While
I'm not an expert on this area, it sounds like the change is major and
(probably) breaking. It seems to be worth having a separate
discussion thread rather than just treat it like one of 25 items.

On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:

> It does seem risky; there are still likely libs out there that don't cross
> compile for 2.13. I would make it the default at 4.0, myself.
>
> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:
>
>> While I support going forward with a higher version, actually using Scala
>> 2.13 by default is a big deal especially in a way that:
>>
>>- Users would likely download the built-in version assuming that it’s
>>backward binary compatible.
>>- PyPI doesn't allow specifying the Scala version, meaning that users
>>wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>>
>> I wonder if it’s safer to do it in Spark 4 (which I believe will be
>> discussed soon).
>>
>>
>> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>>
>>> Thanks Dongjoon!
>>> There are some ticket I want to share.
>>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>>> SPARK-42750 Support INSERT INTO by name
>>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>>
>>> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>>>
 Hi, All.

 Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
 currently a few notable things are under discussions in the mailing list.

 I believe it's a good time to share a short summary list (containing
 both completed and in-progress items) to give a highlight in advance and to
 collect your targets too.

 Please share your expectations or working items if you want to
 prioritize them more in the community in Apache Spark 3.5.0 timeframe.

 (Sorted by ID)
 SPARK-40497 Upgrade Scala 2.13.11
 SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
 SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
 1.12.316)
 SPARK-43024 Upgrade Pandas to 2.0.0
 SPARK-43200 Remove Hadoop 2 reference in docs
 SPARK-43347 Remove Python 3.7 Support
 SPARK-43348 Support Python 3.8 in PyPy3
 SPARK-43351 Add Spark Connect Go prototype code and example
 SPARK-43379 Deprecate old Java 8 versions prior to 8u371
 SPARK-43394 Upgrade to Maven 3.8.8
 SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
 SPARK-43446 Upgrade to Apache Arrow 12.0.0
 SPARK-43447 Support R 4.3.0
 SPARK-43489 Remove protobuf 2.5.0
 SPARK-43519 Bump Parquet to 1.13.1
 SPARK-43581 Upgrade kubernetes-client to 6.6.2
 SPARK-43588 Upgrade to ASM 9.5
 SPARK-43600 Update K8s doc to recommend K8s 1.24+
 SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
 SPARK-43831 Build and Run Spark on Java 21
 SPARK-43832 Upgrade to Scala 2.12.18
 SPARK-43836 Make Scala 2.13 as default in Spark 3.5
 SPARK-43842 Upgrade gcs-connector to 2.2.14
 SPARK-43844 Update to ORC 1.9.0
 UMBRELLA: Add SQL functions into Scala, Python and R API

 Thanks,
 Dongjoon.

 PS. The above is not a list of release blockers. Instead, it could be a
 nice-to-have from someone's perspective.

>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
small correction: "I intentionally didn't enumerate." The meaning could be
quite different so making a small correction.

On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim 
wrote:

> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>
> We made non-trivial numbers of band-aid fixes already for file stream
> sink. For example,
>
> https://github.com/apache/spark/pull/28363
> https://github.com/apache/spark/pull/28904
> https://github.com/apache/spark/pull/29505
> https://github.com/apache/spark/pull/31638
>
> There were many push backs, because these fixes do not solve the real
> problem. The consensus was that we don't want to come up with another Data
> Lake product which requires us to put months (or maybe years) of effort.
> Now, these Data Lake products are backed by companies and they are
> successful projects as individuals. I'm not sure I can be supportive with
> the effort on another band-aid fix.
>
> Maintaining metadata directory is a root of the headache. Unless we see
> the benefit of removing the metadata directory (hence at-least-once) and
> plan to deal with that, I'd like to leave file stream sink as it is.
>
> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
> wrote:
>
>> Hi Jungtaek,
>> integration with Delta Lake is not an option to me, I raised a PR for
>> improvement of FileStreamSink with the new parameter:
>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>>
>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
>> napisał(a):
>>
>>> Hi,
>>>
>>> We have been indicated with lots of issues with the current FileStream
>>> sink. The effort to fix these issues are quite significant, and it ended up
>>> with derivation of "Data Lake" products.
>>>
>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>> work in Databricks so I might be biased, but even when I was working at the
>>> previous employer which didn't have the Data Lake product at that time, I
>>> also had to agree that there are too many things to fix, and the effort
>>> would be fully redundant with existing products.
>>>
>>> Maybe, it might be helpful to have an "at-least-once" version of
>>> FileStream sink, where a metadata directory is no longer needed. It may
>>> require the implementation to go back to the old way of atomic renaming,
>>> but it will also get rid of the necessity of a metadata directory, so
>>> someone might find it useful. For end-to-end exactly once, people can
>>> either use a limited current FileStream sink or use Data Lake products. I
>>> don't see the value in making improvements to the current FileStream sink.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>>> wrote:
>>>
>>>> Hi!
>>>> I raised a ticket on parametrisable output metadata path
>>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>>> I am going to raise a PR against it and I realised, that this
>>>> relatively simple change impacts on method hasMetadata(path), that would
>>>> have a new meaning if we can define custom path for metadata of output
>>>> files. Can you please share your opinion on  how the custom output metadata
>>>> path can impact on design of structured streaming?
>>>> E.g. I can see one case when I set a parameter of output metadata path,
>>>> run a job on output path A, stop the job, change the output path to B and
>>>> hasMetadata works well. If you have any corner case in mind where the
>>>> parametrised output metadata path can break something please describe it.
>>>>
>>>> --
>>>> Kind regards/ Pozdrawiam,
>>>> Wojciech Indyk
>>>>
>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
There seems to be miscommunication - I didn't mean "Delta Lake". I meant
"any" Data Lake products. Since I'm biased I didn't intentionally enumerate
actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.

We made non-trivial numbers of band-aid fixes already for file stream sink.
For example,

https://github.com/apache/spark/pull/28363
https://github.com/apache/spark/pull/28904
https://github.com/apache/spark/pull/29505
https://github.com/apache/spark/pull/31638

There were many push backs, because these fixes do not solve the real
problem. The consensus was that we don't want to come up with another Data
Lake product which requires us to put months (or maybe years) of effort.
Now, these Data Lake products are backed by companies and they are
successful projects as individuals. I'm not sure I can be supportive with
the effort on another band-aid fix.

Maintaining metadata directory is a root of the headache. Unless we see the
benefit of removing the metadata directory (hence at-least-once) and plan
to deal with that, I'd like to leave file stream sink as it is.

On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
wrote:

> Hi Jungtaek,
> integration with Delta Lake is not an option to me, I raised a PR for
> improvement of FileStreamSink with the new parameter:
> https://github.com/apache/spark/pull/40821. Can you please take a look?
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>
>
> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
> napisał(a):
>
>> Hi,
>>
>> We have been indicated with lots of issues with the current FileStream
>> sink. The effort to fix these issues are quite significant, and it ended up
>> with derivation of "Data Lake" products.
>>
>> I'd recommend not to fix the issue but leave it as its limitation, and
>> integrate your workload with Data Lake products. For a full disclaimer, I
>> work in Databricks so I might be biased, but even when I was working at the
>> previous employer which didn't have the Data Lake product at that time, I
>> also had to agree that there are too many things to fix, and the effort
>> would be fully redundant with existing products.
>>
>> Maybe, it might be helpful to have an "at-least-once" version of
>> FileStream sink, where a metadata directory is no longer needed. It may
>> require the implementation to go back to the old way of atomic renaming,
>> but it will also get rid of the necessity of a metadata directory, so
>> someone might find it useful. For end-to-end exactly once, people can
>> either use a limited current FileStream sink or use Data Lake products. I
>> don't see the value in making improvements to the current FileStream sink.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>> wrote:
>>
>>> Hi!
>>> I raised a ticket on parametrisable output metadata path
>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>> I am going to raise a PR against it and I realised, that this relatively
>>> simple change impacts on method hasMetadata(path), that would have a new
>>> meaning if we can define custom path for metadata of output files. Can you
>>> please share your opinion on  how the custom output metadata path can
>>> impact on design of structured streaming?
>>> E.g. I can see one case when I set a parameter of output metadata path,
>>> run a job on output path A, stop the job, change the output path to B and
>>> hasMetadata works well. If you have any corner case in mind where the
>>> parametrised output metadata path can break something please describe it.
>>>
>>> --
>>> Kind regards/ Pozdrawiam,
>>> Wojciech Indyk
>>>
>>


Re: Parametrisable output metadata path

2023-04-15 Thread Jungtaek Lim
Hi,

We have been indicated with lots of issues with the current FileStream
sink. The effort to fix these issues are quite significant, and it ended up
with derivation of "Data Lake" products.

I'd recommend not to fix the issue but leave it as its limitation, and
integrate your workload with Data Lake products. For a full disclaimer, I
work in Databricks so I might be biased, but even when I was working at the
previous employer which didn't have the Data Lake product at that time, I
also had to agree that there are too many things to fix, and the effort
would be fully redundant with existing products.

Maybe, it might be helpful to have an "at-least-once" version of FileStream
sink, where a metadata directory is no longer needed. It may require the
implementation to go back to the old way of atomic renaming, but it will
also get rid of the necessity of a metadata directory, so someone might
find it useful. For end-to-end exactly once, people can either use a
limited current FileStream sink or use Data Lake products. I don't see the
value in making improvements to the current FileStream sink.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
wrote:

> Hi!
> I raised a ticket on parametrisable output metadata path
> https://issues.apache.org/jira/browse/SPARK-43152.
> I am going to raise a PR against it and I realised, that this relatively
> simple change impacts on method hasMetadata(path), that would have a new
> meaning if we can define custom path for metadata of output files. Can you
> please share your opinion on  how the custom output metadata path can
> impact on design of structured streaming?
> E.g. I can see one case when I set a parameter of output metadata path,
> run a job on output path A, stop the job, change the output path to B and
> hasMetadata works well. If you have any corner case in mind where the
> parametrised output metadata path can break something please describe it.
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>


Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-11 Thread Jungtaek Lim
+1 (non-binding)

Thanks for driving the release!

On Wed, Apr 12, 2023 at 3:41 AM Xinrong Meng 
wrote:

> +1 non-binding
>
> Thank you Doogjoon!
>
> Wenchen Fan 于2023年4月10日 周一下午11:32写道:
>
>> +1
>>
>> On Tue, Apr 11, 2023 at 10:09 AM Hyukjin Kwon 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng 
>>> wrote:
>>>
 +1 (non-binding)

 Thank you for driving this release!

 --
 Ruifeng  Zheng
 ruife...@foxmail.com

 



 -- Original --
 *From:* "Yuming Wang" ;
 *Date:* Tue, Apr 11, 2023 09:56 AM
 *To:* "Mridul Muralidharan";
 *Cc:* "huaxin gao";"Chao Sun"<
 sunc...@apache.org>;"yangjie01";"Dongjoon Hyun"<
 dongj...@apache.org>;"Sean Owen";"
 dev@spark.apache.org";
 *Subject:* Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

 +1.

 On Tue, Apr 11, 2023 at 12:17 AM Mridul Muralidharan 
 wrote:

> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos
> -Pkubernetes
>
> Regards,
> Mridul
>
>
> On Mon, Apr 10, 2023 at 10:34 AM huaxin gao 
> wrote:
>
>> +1
>>
>> On Mon, Apr 10, 2023 at 8:17 AM Chao Sun  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Apr 10, 2023 at 7:07 AM yangjie01 
>>> wrote:
>>>
 +1 (non-binding)



 *发件人**: *Sean Owen 
 *日期**: *2023年4月10日 星期一 21:19
 *收件人**: *Dongjoon Hyun 
 *抄送**: *"dev@spark.apache.org" 
 *主题**: *Re: [VOTE] Release Apache Spark 3.2.4 (RC1)



 +1 from me



 On Sun, Apr 9, 2023 at 7:19 PM Dongjoon Hyun 
 wrote:

 I'll start with my +1.

 I verified the checksum, signatures of the artifacts, and
 documentations.
 Also, ran the tests with YARN and K8s modules.

 Dongjoon.

 On 2023/04/09 23:46:10 Dongjoon Hyun wrote:
 > Please vote on releasing the following candidate as Apache Spark
 version
 > 3.2.4.
 >
 > The vote is open until April 13th 1AM (PST) and passes if a
 majority +1 PMC
 > votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 3.2.4
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 https://spark.apache.org/
 
 >
 > The tag to be voted on is v3.2.4-rc1 (commit
 > 0ae10ac18298d1792828f1d59b652ef17462d76e)
 > https://github.com/apache/spark/tree/v3.2.4-rc1
 
 >
 > The release files, including signatures, digests, etc. can be
 found at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-bin/
 
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1442/
 
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.2.4-rc1-docs/
 
 >
 > The list of bug fixes going into 3.2.4 can be found at the
 following URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12352607
 
 >
 > This release is using the release script of the tag v3.2.4-rc1.
 

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-11 Thread Jungtaek Lim
+1 (non-binding)

Thanks for driving the release!

On Wed, Apr 12, 2023 at 10:42 AM Ye Zhou  wrote:

> +1 non-binding
>
> On Tue, Apr 11, 2023 at 18:40 Ye Zhou  wrote:
>
>> Yes, it is not a regression issue. We can fix it after the release.
>>
>> Thanks
>> Ye
>>
>> On Tue, Apr 11, 2023 at 17:42 Xiao Li  wrote:
>>
>>> Thanks for testing it in your environment!
>>>
>>>
 This is a minor issue itself, and only impacts the metrics for
 push-based shuffle, but it will essentially completely eliminate the effort
 in SPARK-36620.
>>>
>>>
>>> Based on my understanding, this is not a regression. It only affects the
>>> new enhancements https://issues.apache.org/jira/browse/SPARK-36620 If
>>> so, it does not block the release RC.
>>>
>>> We should still fix it in 3.4 and the fix will be available in the next
>>> maintenance releases.
>>>
>>> Xiao
>>>
>>>
>>>
>>> Ye Zhou  于2023年4月11日周二 17:14写道:
>>>
 Manually tested the binary in our cluster.
 Started spark-shell application with some shuffle. Found one issue
 which is related to push based shuffle client side metrics introduced in
 https://github.com/apache/spark/pull/36165.
 Filed a ticket https://issues.apache.org/jira/browse/SPARK-43100,
 posted PR there, and verified that the PR fixes the issue.

 This is a minor issue itself, and only impacts the metrics for
 push-based shuffle, but it will essentially completely eliminate the effort
 in SPARK-36620.

 Would like to raise this issue in the voting thread, but hold my
 non-binding -1 here.


 On Tue, Apr 11, 2023 at 1:06 AM Peter Toth 
 wrote:

> +1
>
> Jia Fan  ezt írta (időpont: 2023. ápr. 11., K,
> 9:09):
>
>> +1
>>
>> Wenchen Fan  于2023年4月11日周二 14:32写道:
>>
>>> +1
>>>
>>> On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang 
>>> wrote:
>>>
 +1.

 On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang 
 wrote:

> +1 (non-binding)
>
> Also ran the docker image related test (signatures/standalone/k8s)
> with rc7: https://github.com/apache/spark-docker/pull/32
>
> Regards,
> Yikun
>
>
> On Tue, Apr 11, 2023 at 4:44 AM Jacek Laskowski 
> wrote:
>
>> +1
>>
>> * Built fine with Scala 2.13
>> and 
>> -Pkubernetes,hadoop-cloud,hive,hive-thriftserver,scala-2.13,volcano
>> * Ran some demos on Java 17
>> * Mac mini / Apple M2 Pro / Ventura 13.3.1
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Sat, Apr 8, 2023 at 1:30 AM Xinrong Meng <
>> xinrong.apa...@gmail.com> wrote:
>>
>>> Please vote on releasing the following candidate(RC7) as Apache
>>> Spark version 3.4.0.
>>>
>>> The vote is open until 11:59pm Pacific time *April 12th* and
>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 
>>> votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.4.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.4.0-rc7 (commit
>>> 87a5442f7ed96b11051d8a9333476d080054e5a0):
>>> https://github.com/apache/spark/tree/v3.4.0-rc7
>>>
>>> The release files, including signatures, digests, etc. can be
>>> found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1441
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.4.0-rc7-docs/
>>>
>>> The list of bug fixes going into 3.4.0 can be found at the
>>> following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12351465
>>>
>>> This release is using the release script of the tag v3.4.0-rc7.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by
>>> taking
>>> an existing Spark workload and running on this release
>>> candidate, then

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
Just to be clear, if there is no strong volunteer to make the new community
channel stay active, I'd probably be OK to not fork the channel. You can
see a strong counter example from #spark channel in ASF. It is the place
where there are only questions and promos but zero answers. I see
volunteers here demanding for another channel, so I want to see us go with
the most preferred way for these volunteers.

User mailing list does not go in a good shape. I hope we give another try
with recent technology to see whether we can gain traction - if we fail,
the user mailing list will still be there.

On Tue, Apr 4, 2023 at 7:04 AM Jungtaek Lim 
wrote:

> The number of subscribers doesn't give any meaningful value. Please look
> into the number of mails being sent to the list.
>
> https://lists.apache.org/list.html?u...@spark.apache.org
> The latest month there were more than 200 emails being sent was Feb 2022,
> more than a year ago. It was more than 1k in 2016, and more than 2k in 2015
> and earlier.
> Let's face the fact. User mailing list is dying, even before we start
> discussion about alternative communication methods.
>
> Users never go with the way if it's just because PMC members (or ASF) have
> preference. They are going with the way they are convenient.
>
> Same applies here - if ASF Slack requires a restricted invitation
> mechanism then it won't work. Looks like there is a link for an invitation,
> but we are also talking about the cost as well.
> https://cwiki.apache.org/confluence/display/INFRA/Slack+Guest+Invites
> As long as we are being serious about the cost, I don't think we are going
> to land in the way "users" are convenient.
>
> On Tue, Apr 4, 2023 at 4:59 AM Dongjoon Hyun 
> wrote:
>
>> As Mich Talebzadeh pointed out, Apache Spark has an official Slack
>> channel.
>>
>> > It's unavoidable if "users" prefer to use an alternative communication
>> mechanism rather than the user mailing list.
>>
>> The following is the number of people in the official channels.
>>
>> - u...@spark.apache.org has 4519 subscribers.
>> - dev@spark.apache.org has 3149 subscribers.
>> - ASF Official Slack channel has 602 subscribers.
>>
>> May I ask if the users prefer to use the ASF Official Slack channel
>> than the user mailing list?
>>
>> Dongjoon.
>>
>>
>>
>> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm reading through the page "Briefing: The Apache Way", and in the
>>> section of "Open Communications", restriction of communication inside ASF
>>> INFRA (mailing list) is more about code and decision-making.
>>>
>>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>>
>>> It's unavoidable if "users" prefer to use an alternative communication
>>> mechanism rather than the user mailing list. Before Stack Overflow days,
>>> there had been a meaningful number of questions around user@. It's just
>>> impossible to let them go back and post to the user mailing list.
>>>
>>> We just need to make sure it is not the purpose of employing Slack to
>>> move all discussions about developments, direction of the project, etc
>>> which must happen in dev@/private@. The purpose of Slack thread here
>>> does not seem to aim to serve the purpose.
>>>
>>>
>>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Good discussions and proposals.all around.
>>>>
>>>> I have used slack in anger on a customer site before. For small and
>>>> medium size groups it is good and affordable. Alternatives have been
>>>> suggested as well so those who like investigative search can agree and come
>>>> up with a freebie one.
>>>> I am inclined to agree with Bjorn that this slack has more social
>>>> dimensions than the mailing list. It is akin to a sports club using
>>>> WhatsApp groups for communication. Remember we were originally looking for
>>>> space for webinars, including Spark on Linkedin that Denney Lee suggested.
>>>> I think Slack and mailing groups can coexist happily. On a more serious
>>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>>> traffic. Currently we hardly get many mails daily <> less than 5. So having
>>>> a slack type medium may improve members participation.
>>>>
>>>> so +1 for me as well.
>>>>
>>>> Mich Tal

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
The number of subscribers doesn't give any meaningful value. Please look
into the number of mails being sent to the list.

https://lists.apache.org/list.html?u...@spark.apache.org
The latest month there were more than 200 emails being sent was Feb 2022,
more than a year ago. It was more than 1k in 2016, and more than 2k in 2015
and earlier.
Let's face the fact. User mailing list is dying, even before we start
discussion about alternative communication methods.

Users never go with the way if it's just because PMC members (or ASF) have
preference. They are going with the way they are convenient.

Same applies here - if ASF Slack requires a restricted invitation mechanism
then it won't work. Looks like there is a link for an invitation, but we
are also talking about the cost as well.
https://cwiki.apache.org/confluence/display/INFRA/Slack+Guest+Invites
As long as we are being serious about the cost, I don't think we are going
to land in the way "users" are convenient.

On Tue, Apr 4, 2023 at 4:59 AM Dongjoon Hyun 
wrote:

> As Mich Talebzadeh pointed out, Apache Spark has an official Slack channel.
>
> > It's unavoidable if "users" prefer to use an alternative communication
> mechanism rather than the user mailing list.
>
> The following is the number of people in the official channels.
>
> - u...@spark.apache.org has 4519 subscribers.
> - dev@spark.apache.org has 3149 subscribers.
> - ASF Official Slack channel has 602 subscribers.
>
> May I ask if the users prefer to use the ASF Official Slack channel
> than the user mailing list?
>
> Dongjoon.
>
>
>
> On Thu, Mar 30, 2023 at 9:10 PM Jungtaek Lim 
> wrote:
>
>> I'm reading through the page "Briefing: The Apache Way", and in the
>> section of "Open Communications", restriction of communication inside ASF
>> INFRA (mailing list) is more about code and decision-making.
>>
>> https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define
>>
>> It's unavoidable if "users" prefer to use an alternative communication
>> mechanism rather than the user mailing list. Before Stack Overflow days,
>> there had been a meaningful number of questions around user@. It's just
>> impossible to let them go back and post to the user mailing list.
>>
>> We just need to make sure it is not the purpose of employing Slack to
>> move all discussions about developments, direction of the project, etc
>> which must happen in dev@/private@. The purpose of Slack thread here
>> does not seem to aim to serve the purpose.
>>
>>
>> On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Good discussions and proposals.all around.
>>>
>>> I have used slack in anger on a customer site before. For small and
>>> medium size groups it is good and affordable. Alternatives have been
>>> suggested as well so those who like investigative search can agree and come
>>> up with a freebie one.
>>> I am inclined to agree with Bjorn that this slack has more social
>>> dimensions than the mailing list. It is akin to a sports club using
>>> WhatsApp groups for communication. Remember we were originally looking for
>>> space for webinars, including Spark on Linkedin that Denney Lee suggested.
>>> I think Slack and mailing groups can coexist happily. On a more serious
>>> note, when I joined the user group back in 2015-2016, there was a lot of
>>> traffic. Currently we hardly get many mails daily <> less than 5. So having
>>> a slack type medium may improve members participation.
>>>
>>> so +1 for me as well.
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 30 Mar 2023 at 22:19, Denny Lee  wrote:
>>>
>>>> +1.
>>>>
>>>> To Shani’s point, there are multiple OSS projects that use the free
>>>> Slack ver

Re: Slack for PySpark users

2023-03-30 Thread Jungtaek Lim
I'm reading through the page "Briefing: The Apache Way", and in the section
of "Open Communications", restriction of communication inside ASF INFRA
(mailing list) is more about code and decision-making.
https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define

It's unavoidable if "users" prefer to use an alternative communication
mechanism rather than the user mailing list. Before Stack Overflow days,
there had been a meaningful number of questions around user@. It's just
impossible to let them go back and post to the user mailing list.

We just need to make sure it is not the purpose of employing Slack to move
all discussions about developments, direction of the project, etc which
must happen in dev@/private@. The purpose of Slack thread here does not
seem to aim to serve the purpose.


On Fri, Mar 31, 2023 at 7:00 AM Mich Talebzadeh 
wrote:

> Good discussions and proposals.all around.
>
> I have used slack in anger on a customer site before. For small and medium
> size groups it is good and affordable. Alternatives have been suggested as
> well so those who like investigative search can agree and come up with a
> freebie one.
> I am inclined to agree with Bjorn that this slack has more social
> dimensions than the mailing list. It is akin to a sports club using
> WhatsApp groups for communication. Remember we were originally looking for
> space for webinars, including Spark on Linkedin that Denney Lee suggested.
> I think Slack and mailing groups can coexist happily. On a more serious
> note, when I joined the user group back in 2015-2016, there was a lot of
> traffic. Currently we hardly get many mails daily <> less than 5. So having
> a slack type medium may improve members participation.
>
> so +1 for me as well.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 30 Mar 2023 at 22:19, Denny Lee  wrote:
>
>> +1.
>>
>> To Shani’s point, there are multiple OSS projects that use the free Slack
>> version - top of mind include Delta, Presto, Flink, Trino, Datahub, MLflow,
>> etc.
>>
>> On Thu, Mar 30, 2023 at 14:15  wrote:
>>
>>> Hey everyone,
>>>
>>> I think we should remain on a free program in slack.
>>>
>>> In my option the free program is more then enough, the only down side is
>>> we could only see the last 90 days messages.
>>>
>>> From what I know the Airflow community (which has strong active
>>> community in slack) also use the free program (You can tell by the 90 days
>>> limit notice in their workspace).
>>>
>>> You can find the pricing and features comparison between the slack
>>> programs here  .
>>>
>>> Have a great day,
>>> Shani
>>>
>>> On 30 Mar 2023, at 23:38, Mridul Muralidharan  wrote:
>>>
>>> 
>>>
>>>
>>> Thanks for flagging the concern Dongjoon, I was not aware of the
>>> discussion - but I can understand the concern.
>>> Would be great if you or Matei could update the thread on the result of
>>> deliberations, once it reaches a logical consensus: before we set up
>>> official policy around it.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Thu, Mar 30, 2023 at 4:23 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
 I like the idea of having a talk channel. It can make it easier for
 everyone to say hello. Or to dare to ask about small or big matters that
 you would not have dared to ask about before on mailing lists.
 But then there is the price and what is the best for an open source
 project.

 The price for using slack is expensive.
 Right now for those that have join spark slack
 $8.75 USD
 72 members
 1 month
 $630 USD

 https://app.slack.com/plans/T04URTRBZ1R/checkout/form?entry_point=hero_banner_upgrade_cta=2

 And they - slack does not have an option for open source projects.

 There seems to be some alternatives for open source software. I have
 not tried it.
 Like https://www.rocket.chat/blog/slack-open-source-alternatives

 


 rocket chat is open source https://github.com/RocketChat/Rocket.Chat

 tor. 30. mar. 2023 kl. 18:54 skrev Mich Talebzadeh <
 mich.talebza...@gmail.com>:

> Hi Dongjoon
>
> to your points if I may
>
> - Do you have any reference from other official ASF-related Slack
> channels?
>No, I don't have any reference from other official ASF-related
> Slack channels because I don't think 

Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-20 Thread Jungtaek Lim
Heads-up: It's addressed via
https://issues.apache.org/jira/browse/SPARK-42075. We just marked
deprecation in the entry point of DStream, StreamContext. Marking all
classes in the DStream module is not pragmatic and users would see the
warning message anyway.

On Mon, Jan 16, 2023 at 8:26 AM Jungtaek Lim 
wrote:

> Given that I got more than 3 PMC members' positive votes as well as
> several active contributors' positive votes as well, I will proceed with
> the actual work.
> (It may take a couple of more days as folk in US will help me and there's
> a holiday in US.)
>
> Please let me know if we want to have an official vote thread before
> moving forward.
>
> Thanks all for providing your voices on this!
>
> On Sat, Jan 14, 2023 at 3:56 AM Anish Shrigondekar <
> anish.shrigonde...@databricks.com> wrote:
>
>> +1 on the Dstreams deprecation proposal
>>
>> On Fri, Jan 13, 2023 at 10:47 AM Jerry Peng 
>> wrote:
>>
>>> +1 in general for marking the DStreams API as deprecated
>>>
>>> Jungtaek, can you please provide / elaborate on the concrete actions you
>>> intend on taking for the depreciation process?
>>>
>>> Best,
>>>
>>> Jerry
>>>
>>> On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
>>>>  wrote:
>>>> >
>>>> > Yes, exactly. I'm sorry to bring confusion - should have clarified
>>>> action items on the proposal.
>>>> >
>>>> > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun <
>>>> dongjoon.h...@gmail.com> wrote:
>>>> >>
>>>> >> Then, could you elaborate `the proposed code change` specifically?
>>>> >> Maybe, usual deprecation warning logs and annotation on the API?
>>>> >>
>>>> >>
>>>> >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>> >>>
>>>> >>> Maybe I need to clarify - my proposal is "explicitly" deprecating
>>>> it, which incurs code change for sure. Guidance on the Spark website is
>>>> done already as I mentioned - we updated the DStream doc page to mention
>>>> that DStream is a "legacy" project and users should move to SS. I don't
>>>> feel this is sufficient to refrain users from using it, hence initiating
>>>> this proposal.
>>>> >>>
>>>> >>> Sorry to make confusion. I just wanted to make sure the goal of the
>>>> proposal is not "removing" the API. The discussion on the removal of API
>>>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>>>> >>>
>>>> >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun <
>>>> dongjoon.h...@gmail.com> wrote:
>>>> >>>>
>>>> >>>> +1 for the proposal (guiding only without any code change).
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> Dongjoon.
>>>> >>>>
>>>> >>>> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
>>>> wrote:
>>>> >>>>>
>>>> >>>>> +1
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>>>> tathagata.das1...@gmail.com> wrote:
>>>> >>>>>>
>>>> >>>>>> +1
>>>> >>>>>>
>>>> >>>>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon <
>>>> gurwls...@gmail.com> wrote:
>>>> >>>>>>>
>>>> >>>>>>> +1
>>>> >>>>>>>
>>>> >>>>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> bump for more visibility.
>>>> >>>>>>>>
>>>> >>>>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Hi dev,
>>>> >>>>>>>>>
>>>> >>>>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4,
>>>> in favor of promoting Structured Streaming.
>>>> >>>>>>>>> (Sorry for the late proposal, if we don't make the change in
>>>> 3.4, we will have to wait for another 6 months.)
>>>> >>>>>>>>>
>>>> >>>>>>>>> We have been focusing on Structured Streaming for years
>>>> (across multiple major and minor versions), and during the time we haven't
>>>> made any improvements for DStream. Furthermore, recently we updated the
>>>> DStream doc to explicitly say DStream is a legacy project.
>>>> >>>>>>>>>
>>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>>> >>>>>>>>>
>>>> >>>>>>>>> The baseline of deprecation is that we don't see a particular
>>>> use case which only DStream solves. This is a different story with GraphX
>>>> and MLLIB, as we don't have replacements for that.
>>>> >>>>>>>>>
>>>> >>>>>>>>> The proposal does not mean we will remove the API soon, as
>>>> the Spark project has been making deprecation against public API. I don't
>>>> intend to propose the target version for removal. The goal is to guide
>>>> users to refrain from constructing a new workload with DStream. We might
>>>> want to go with this in future, but it would require a new discussion
>>>> thread at that time.
>>>> >>>>>>>>>
>>>> >>>>>>>>> What do you think?
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thanks,
>>>> >>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Jungtaek Lim
+1 on delaying. I see there’s a JIRA ticket about DStream depreciation, we
are working on this - thanks for taking this into account!

2023년 1월 18일 (수) 오후 12:43, Hyukjin Kwon 님이 작성:

> +1. Thanks for driving this, Xinrong.
>
> On Wed, 18 Jan 2023 at 12:31, Xinrong Meng 
> wrote:
>
>> Hi All,
>>
>> Considering there are still important issues unresolved (some are as
>> shown below), I would suggest to be conservative, we delay the branch-3.4's
>> cut for one week.
>>
>> https://issues.apache.org/jira/browse/SPARK-39375
>> https://issues.apache.org/jira/browse/SPARK-41589
>> https://issues.apache.org/jira/browse/SPARK-42075
>> https://issues.apache.org/jira/browse/SPARK-25299
>> https://issues.apache.org/jira/browse/SPARK-41053
>>
>> I plan to cut *branch-3.4* at *18:30 PT, January 24, 2023*. Please
>> ensure your changes for Apache Spark 3.4 to be ready by that time.
>>
>> Feel free to reply to the email if you have other ongoing big items for
>> Spark 3.4.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>> On Sat, Jan 7, 2023 at 9:16 AM Hyukjin Kwon  wrote:
>>
>>> Thanks Xinrong.
>>>
>>> On Sat, Jan 7, 2023 at 9:18 AM Xinrong Meng 
>>> wrote:
>>>
 The release window for Apache Spark 3.4.0 is updated per
 https://github.com/apache/spark-website/pull/430.

 Thank you all!

 On Thu, Jan 5, 2023 at 2:10 PM Maxim Gekk 
 wrote:

> +1
>
> On Thu, Jan 5, 2023 at 12:25 AM huaxin gao 
> wrote:
>
>> +1 Thanks!
>>
>> On Wed, Jan 4, 2023 at 10:19 AM L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> Thank you!
>>>
>>> On Wed, Jan 4, 2023 at 9:13 AM Chao Sun  wrote:
>>>
 +1, thanks!

 Chao

 On Wed, Jan 4, 2023 at 1:56 AM Mridul Muralidharan <
 mri...@gmail.com> wrote:

>
> +1, Thanks !
>
> Regards,
> Mridul
>
> On Wed, Jan 4, 2023 at 2:20 AM Gengliang Wang 
> wrote:
>
>> +1, thanks for driving the release!
>>
>>
>> Gengliang
>>
>> On Tue, Jan 3, 2023 at 10:55 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Thank you!
>>>
>>> Dongjoon
>>>
>>> On Tue, Jan 3, 2023 at 9:44 PM Rui Wang 
>>> wrote:
>>>
 +1 to cut the branch starting from a workday!

 Great to see this is happening!

 Thanks Xinrong!

 -Rui

 On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com <
 ruife...@foxmail.com> wrote:

> +1, thank you Xinrong for driving this release!
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Hyukjin Kwon" ;
> *Date:* Wed, Jan 4, 2023 01:15 PM
> *To:* "Xinrong Meng";
> *Cc:* "dev";
> *Subject:* Re: Time for Spark 3.4.0 release?
>
> SGTM +1
>
> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng <
> xinrong.apa...@gmail.com> wrote:
>
>> Hi All,
>>
>> Shall we cut *branch-3.4* on *January 16th, 2023*? We
>> proposed January 15th per
>> https://spark.apache.org/versioning-policy.html, but I would
>> suggest we postpone one day since January 15th is a Sunday.
>>
>> I would like to volunteer as the release manager for *Apache
>> Spark 3.4.0*.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-15 Thread Jungtaek Lim
Given that I got more than 3 PMC members' positive votes as well as several
active contributors' positive votes as well, I will proceed with the actual
work.
(It may take a couple of more days as folk in US will help me and there's a
holiday in US.)

Please let me know if we want to have an official vote thread before moving
forward.

Thanks all for providing your voices on this!

On Sat, Jan 14, 2023 at 3:56 AM Anish Shrigondekar <
anish.shrigonde...@databricks.com> wrote:

> +1 on the Dstreams deprecation proposal
>
> On Fri, Jan 13, 2023 at 10:47 AM Jerry Peng 
> wrote:
>
>> +1 in general for marking the DStreams API as deprecated
>>
>> Jungtaek, can you please provide / elaborate on the concrete actions you
>> intend on taking for the depreciation process?
>>
>> Best,
>>
>> Jerry
>>
>> On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:
>>
>>> +1
>>>
>>> On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
>>>  wrote:
>>> >
>>> > Yes, exactly. I'm sorry to bring confusion - should have clarified
>>> action items on the proposal.
>>> >
>>> > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun 
>>> wrote:
>>> >>
>>> >> Then, could you elaborate `the proposed code change` specifically?
>>> >> Maybe, usual deprecation warning logs and annotation on the API?
>>> >>
>>> >>
>>> >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>
>>> >>> Maybe I need to clarify - my proposal is "explicitly" deprecating
>>> it, which incurs code change for sure. Guidance on the Spark website is
>>> done already as I mentioned - we updated the DStream doc page to mention
>>> that DStream is a "legacy" project and users should move to SS. I don't
>>> feel this is sufficient to refrain users from using it, hence initiating
>>> this proposal.
>>> >>>
>>> >>> Sorry to make confusion. I just wanted to make sure the goal of the
>>> proposal is not "removing" the API. The discussion on the removal of API
>>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>>> >>>
>>> >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>> >>>>
>>> >>>> +1 for the proposal (guiding only without any code change).
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Dongjoon.
>>> >>>>
>>> >>>> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
>>> wrote:
>>> >>>>>
>>> >>>>> +1
>>> >>>>>
>>> >>>>>
>>> >>>>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>> >>>>>>
>>> >>>>>> +1
>>> >>>>>>
>>> >>>>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>>> wrote:
>>> >>>>>>>
>>> >>>>>>> +1
>>> >>>>>>>
>>> >>>>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> bump for more visibility.
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Hi dev,
>>> >>>>>>>>>
>>> >>>>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4,
>>> in favor of promoting Structured Streaming.
>>> >>>>>>>>> (Sorry for the late proposal, if we don't make the change in
>>> 3.4, we will have to wait for another 6 months.)
>>> >>>>>>>>>
>>> >>>>>>>>> We have been focusing on Structured Streaming for years
>>> (across multiple major and minor versions), and during the time we haven't
>>> made any improvements for DStream. Furthermore, recently we updated the
>>> DStream doc to explicitly say DStream is a legacy project.
>>> >>>>>>>>>
>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>> >>>>>>>>>
>>> >>>>>>>>> The baseline of deprecation is that we don't see a particular
>>> use case which only DStream solves. This is a different story with GraphX
>>> and MLLIB, as we don't have replacements for that.
>>> >>>>>>>>>
>>> >>>>>>>>> The proposal does not mean we will remove the API soon, as the
>>> Spark project has been making deprecation against public API. I don't
>>> intend to propose the target version for removal. The goal is to guide
>>> users to refrain from constructing a new workload with DStream. We might
>>> want to go with this in future, but it would require a new discussion
>>> thread at that time.
>>> >>>>>>>>>
>>> >>>>>>>>> What do you think?
>>> >>>>>>>>>
>>> >>>>>>>>> Thanks,
>>> >>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-15 Thread Jungtaek Lim
I described it in the thread -  I had to add it in the reply so it's not
easy to find. Sorry for the inconvenience.

https://lists.apache.org/thread/d9yg7w9pnb9rw7c2yglp4qk6jt43y0kw


On Sat, Jan 14, 2023 at 3:46 AM Jerry Peng 
wrote:

> +1 in general for marking the DStreams API as deprecated
>
> Jungtaek, can you please provide / elaborate on the concrete actions you
> intend on taking for the depreciation process?
>
> Best,
>
> Jerry
>
> On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
>>  wrote:
>> >
>> > Yes, exactly. I'm sorry to bring confusion - should have clarified
>> action items on the proposal.
>> >
>> > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun 
>> wrote:
>> >>
>> >> Then, could you elaborate `the proposed code change` specifically?
>> >> Maybe, usual deprecation warning logs and annotation on the API?
>> >>
>> >>
>> >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>
>> >>> Maybe I need to clarify - my proposal is "explicitly" deprecating it,
>> which incurs code change for sure. Guidance on the Spark website is done
>> already as I mentioned - we updated the DStream doc page to mention that
>> DStream is a "legacy" project and users should move to SS. I don't feel
>> this is sufficient to refrain users from using it, hence initiating this
>> proposal.
>> >>>
>> >>> Sorry to make confusion. I just wanted to make sure the goal of the
>> proposal is not "removing" the API. The discussion on the removal of API
>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>> >>>
>> >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >>>>
>> >>>> +1 for the proposal (guiding only without any code change).
>> >>>>
>> >>>> Thanks,
>> >>>> Dongjoon.
>> >>>>
>> >>>> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
>> wrote:
>> >>>>>
>> >>>>> +1
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> +1
>> >>>>>>
>> >>>>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>> wrote:
>> >>>>>>>
>> >>>>>>> +1
>> >>>>>>>
>> >>>>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>>>>>>
>> >>>>>>>> bump for more visibility.
>> >>>>>>>>
>> >>>>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi dev,
>> >>>>>>>>>
>> >>>>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4, in
>> favor of promoting Structured Streaming.
>> >>>>>>>>> (Sorry for the late proposal, if we don't make the change in
>> 3.4, we will have to wait for another 6 months.)
>> >>>>>>>>>
>> >>>>>>>>> We have been focusing on Structured Streaming for years (across
>> multiple major and minor versions), and during the time we haven't made any
>> improvements for DStream. Furthermore, recently we updated the DStream doc
>> to explicitly say DStream is a legacy project.
>> >>>>>>>>>
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>> >>>>>>>>>
>> >>>>>>>>> The baseline of deprecation is that we don't see a particular
>> use case which only DStream solves. This is a different story with GraphX
>> and MLLIB, as we don't have replacements for that.
>> >>>>>>>>>
>> >>>>>>>>> The proposal does not mean we will remove the API soon, as the
>> Spark project has been making deprecation against public API. I don't
>> intend to propose the target version for removal. The goal is to guide
>> users to refrain from constructing a new workload with DStream. We might
>> want to go with this in future, but it would require a new discussion
>> thread at that time.
>> >>>>>>>>>
>> >>>>>>>>> What do you think?
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Jungtaek Lim
Yes, exactly. I'm sorry to bring confusion - should have clarified action
items on the proposal.

On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun 
wrote:

> Then, could you elaborate `the proposed code change` specifically?
> Maybe, usual deprecation warning logs and annotation on the API?
>
>
> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Maybe I need to clarify - my proposal is "explicitly" deprecating it,
>> which incurs code change for sure. Guidance on the Spark website is done
>> already as I mentioned - we updated the DStream doc page to mention that
>> DStream is a "legacy" project and users should move to SS. I don't feel
>> this is sufficient to refrain users from using it, hence initiating
>> this proposal.
>>
>> Sorry to make confusion. I just wanted to make sure the goal of the
>> proposal is not "removing" the API. The discussion on the removal of API
>> doesn't tend to go well, so I wanted to make sure I don't mean that.
>>
>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1 for the proposal (guiding only without any code change).
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu  wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>>>> tathagata.das1...@gmail.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> bump for more visibility.
>>>>>>>
>>>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi dev,
>>>>>>>>
>>>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4, in
>>>>>>>> favor of promoting Structured Streaming.
>>>>>>>> (Sorry for the late proposal, if we don't make the change in 3.4,
>>>>>>>> we will have to wait for another 6 months.)
>>>>>>>>
>>>>>>>> We have been focusing on Structured Streaming for years (across
>>>>>>>> multiple major and minor versions), and during the time we haven't 
>>>>>>>> made any
>>>>>>>> improvements for DStream. Furthermore, recently we updated the DStream 
>>>>>>>> doc
>>>>>>>> to explicitly say DStream is a legacy project.
>>>>>>>>
>>>>>>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>>>>>>>
>>>>>>>> The baseline of deprecation is that we don't see a particular use
>>>>>>>> case which only DStream solves. This is a different story with GraphX 
>>>>>>>> and
>>>>>>>> MLLIB, as we don't have replacements for that.
>>>>>>>>
>>>>>>>> The proposal does not mean we will remove the API soon, as the
>>>>>>>> Spark project has been making deprecation against public API. I don't
>>>>>>>> intend to propose the target version for removal. The goal is to guide
>>>>>>>> users to refrain from constructing a new workload with DStream. We 
>>>>>>>> might
>>>>>>>> want to go with this in future, but it would require a new discussion
>>>>>>>> thread at that time.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>
>>>>>>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Jungtaek Lim
There might be possible terminology differences, so let me elaborate the
action item from the proposal explicitly:

- Add "deprecation" annotation to the user-facing public API in streaming
directory (DStream)
- Write a release note to explicitly mention the deprecation. (Maybe
promote again that they are encouraged to move to SS.)

This is not an action item from the proposal:

- Add (tentative) target version to remove the API on the deprecation
message.

Hope this makes the proposal crystally clear.

On Fri, Jan 13, 2023 at 3:05 PM Jungtaek Lim 
wrote:

> Maybe I need to clarify - my proposal is "explicitly" deprecating it,
> which incurs code change for sure. Guidance on the Spark website is done
> already as I mentioned - we updated the DStream doc page to mention that
> DStream is a "legacy" project and users should move to SS. I don't feel
> this is sufficient to refrain users from using it, hence initiating
> this proposal.
>
> Sorry to make confusion. I just wanted to make sure the goal of the
> proposal is not "removing" the API. The discussion on the removal of API
> doesn't tend to go well, so I wanted to make sure I don't mean that.
>
> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun 
> wrote:
>
>> +1 for the proposal (guiding only without any code change).
>>
>> Thanks,
>> Dongjoon.
>>
>> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu  wrote:
>>
>>> +1
>>>
>>>
>>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> bump for more visibility.
>>>>>>
>>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4, in
>>>>>>> favor of promoting Structured Streaming.
>>>>>>> (Sorry for the late proposal, if we don't make the change in 3.4, we
>>>>>>> will have to wait for another 6 months.)
>>>>>>>
>>>>>>> We have been focusing on Structured Streaming for years (across
>>>>>>> multiple major and minor versions), and during the time we haven't made 
>>>>>>> any
>>>>>>> improvements for DStream. Furthermore, recently we updated the DStream 
>>>>>>> doc
>>>>>>> to explicitly say DStream is a legacy project.
>>>>>>>
>>>>>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>>>>>>
>>>>>>> The baseline of deprecation is that we don't see a particular use
>>>>>>> case which only DStream solves. This is a different story with GraphX 
>>>>>>> and
>>>>>>> MLLIB, as we don't have replacements for that.
>>>>>>>
>>>>>>> The proposal does not mean we will remove the API soon, as the Spark
>>>>>>> project has been making deprecation against public API. I don't intend 
>>>>>>> to
>>>>>>> propose the target version for removal. The goal is to guide users to
>>>>>>> refrain from constructing a new workload with DStream. We might want to 
>>>>>>> go
>>>>>>> with this in future, but it would require a new discussion thread at 
>>>>>>> that
>>>>>>> time.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Jungtaek Lim
Maybe I need to clarify - my proposal is "explicitly" deprecating it, which
incurs code change for sure. Guidance on the Spark website is done already
as I mentioned - we updated the DStream doc page to mention that DStream is
a "legacy" project and users should move to SS. I don't feel this is
sufficient to refrain users from using it, hence initiating this proposal.

Sorry to make confusion. I just wanted to make sure the goal of the
proposal is not "removing" the API. The discussion on the removal of API
doesn't tend to go well, so I wanted to make sure I don't mean that.

On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun 
wrote:

> +1 for the proposal (guiding only without any code change).
>
> Thanks,
> Dongjoon.
>
> On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu  wrote:
>
>> +1
>>
>>
>> On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> bump for more visibility.
>>>>>
>>>>> On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi dev,
>>>>>>
>>>>>> I'd like to propose the deprecation of DStream in Spark 3.4, in favor
>>>>>> of promoting Structured Streaming.
>>>>>> (Sorry for the late proposal, if we don't make the change in 3.4, we
>>>>>> will have to wait for another 6 months.)
>>>>>>
>>>>>> We have been focusing on Structured Streaming for years (across
>>>>>> multiple major and minor versions), and during the time we haven't made 
>>>>>> any
>>>>>> improvements for DStream. Furthermore, recently we updated the DStream 
>>>>>> doc
>>>>>> to explicitly say DStream is a legacy project.
>>>>>>
>>>>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>>>>>>
>>>>>> The baseline of deprecation is that we don't see a particular use
>>>>>> case which only DStream solves. This is a different story with GraphX and
>>>>>> MLLIB, as we don't have replacements for that.
>>>>>>
>>>>>> The proposal does not mean we will remove the API soon, as the Spark
>>>>>> project has been making deprecation against public API. I don't intend to
>>>>>> propose the target version for removal. The goal is to guide users to
>>>>>> refrain from constructing a new workload with DStream. We might want to 
>>>>>> go
>>>>>> with this in future, but it would require a new discussion thread at that
>>>>>> time.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-12 Thread Jungtaek Lim
bump for more visibility.

On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
> promoting Structured Streaming.
> (Sorry for the late proposal, if we don't make the change in 3.4, we will
> have to wait for another 6 months.)
>
> We have been focusing on Structured Streaming for years (across multiple
> major and minor versions), and during the time we haven't made any
> improvements for DStream. Furthermore, recently we updated the DStream doc
> to explicitly say DStream is a legacy project.
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
>
> The baseline of deprecation is that we don't see a particular use case
> which only DStream solves. This is a different story with GraphX and MLLIB,
> as we don't have replacements for that.
>
> The proposal does not mean we will remove the API soon, as the Spark
> project has been making deprecation against public API. I don't intend to
> propose the target version for removal. The goal is to guide users to
> refrain from constructing a new workload with DStream. We might want to go
> with this in future, but it would require a new discussion thread at that
> time.
>
> What do you think?
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


[DISCUSS] Deprecate DStream in 3.4

2023-01-10 Thread Jungtaek Lim
Hi dev,

I'd like to propose the deprecation of DStream in Spark 3.4, in favor of
promoting Structured Streaming.
(Sorry for the late proposal, if we don't make the change in 3.4, we will
have to wait for another 6 months.)

We have been focusing on Structured Streaming for years (across multiple
major and minor versions), and during the time we haven't made any
improvements for DStream. Furthermore, recently we updated the DStream doc
to explicitly say DStream is a legacy project.
https://spark.apache.org/docs/latest/streaming-programming-guide.html#note

The baseline of deprecation is that we don't see a particular use case
which only DStream solves. This is a different story with GraphX and MLLIB,
as we don't have replacements for that.

The proposal does not mean we will remove the API soon, as the Spark
project has been making deprecation against public API. I don't intend to
propose the target version for removal. The goal is to guide users to
refrain from constructing a new workload with DStream. We might want to go
with this in future, but it would require a new discussion thread at that
time.

What do you think?

Thanks,
Jungtaek Lim (HeartSaVioR)


[VOTE][RESULT][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-04 Thread Jungtaek Lim
The vote passes with 7 +1s (5 binding +1s).
Thanks to all who reviews the SPIP doc and votes!

(* = binding)
+1:
- Jungtaek Lim
- Xingbo Jiang
- Mridul Muralidharan (*)
- Hyukjin Kwon (*)
- Shixiong Zhu (*)
- Wenchen Fan (*)
- Dongjoon Hyun (*)

+0: None

-1: None

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
Starting with +1 from me.

On Thu, Dec 1, 2022 at 10:54 AM Jungtaek Lim 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Asynchronous Offset Management in
> Structured Streaming.
>
> The high level summary of the SPIP is that we propose a couple of
> improvements on offset management in microbatch execution to lower down
> processing latency, which would help for certain types of workloads.
>
> References:
>
>- JIRA ticket <https://issues.apache.org/jira/browse/SPARK-39591>
>- SPIP doc
>
> <https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing>
>- Discussion thread
><https://lists.apache.org/thread/yv8ffr56prjr16qh12lwjyjl1q8dl7lp>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>


[VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
Hi all,

I'd like to start the vote for SPIP: Asynchronous Offset Management in
Structured Streaming.

The high level summary of the SPIP is that we propose a couple of
improvements on offset management in microbatch execution to lower down
processing latency, which would help for certain types of workloads.

References:

   - JIRA ticket <https://issues.apache.org/jira/browse/SPARK-39591>
   - SPIP doc
   
<https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing>
   - Discussion thread
   <https://lists.apache.org/thread/yv8ffr56prjr16qh12lwjyjl1q8dl7lp>

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Jungtaek Lim
Thanks all for the support! Great to see we drive the discussion for
Structured Streaming and have sufficient support.

We would like to move forward with the vote thread. Please also participate
in the vote. Thanks again!

On Thu, Dec 1, 2022 at 10:04 AM Wenchen Fan  wrote:

> +1 to improve the widely used micro-batch mode first.
>
> On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon  wrote:
>
>> +1
>>
>> On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu  wrote:
>>
>>> +1
>>>
>>> This is exciting. I agree with Jerry that this SPIP and continuous
>>> processing are orthogonal. This SPIP itself would be a great improvement
>>> and impact most Structured Streaming users.
>>>
>>> Best Regards,
>>> Shixiong
>>>
>>>
>>> On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>>
>>>> Thanks for all the clarifications and details Jerry, Jungtaek :-)
>>>> This looks like an exciting improvement to Structured Streaming -
>>>> looking forward to it becoming part of Apache Spark !
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I will add my two cents.  Improving the Microbatch execution engine
>>>>> does not prevent us from working/improving on the continuous execution
>>>>> engine in the future.  These are orthogonal issues.  This new mode I am
>>>>> proposing in the microbatch execution engine intends to lower latency of
>>>>> this execution engine that most people use today.  We can view it as an
>>>>> incremental improvement on the existing engine. I see the continuous
>>>>> execution engine as a partially completed re-write of spark streaming and
>>>>> may serve as the "future" engine powering Spark Streaming.   Improving the
>>>>> "current" engine does not mean we cannot work on a "future" engine.  These
>>>>> two are not mutually exclusive. I would like to focus the discussion on 
>>>>> the
>>>>> merits of this feature in regards to the current micro-batch execution
>>>>> engine and not a discussion on the future of continuous execution engine.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jerry
>>>>>
>>>>>
>>>>> On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi Mridul,
>>>>>>
>>>>>> I'd like to make clear to avoid any misunderstanding - the decision
>>>>>> was not led by me. (I'm just a one of engineers in the team. Not even 
>>>>>> TL.)
>>>>>> As you see the direction, there was an internal consensus to not revisit
>>>>>> the continuous mode. There are various reasons, which I think we know
>>>>>> already. You seem to remember I have raised concerns about continuous 
>>>>>> mode,
>>>>>> but have you indicated that it was even over 2 years ago? I still see no
>>>>>> traction around the project. The main reason I abandoned the discussion 
>>>>>> was
>>>>>> due to promising effort on integrating push based shuffle into continuous
>>>>>> mode to achieve shuffle, but no effort has been made so far.
>>>>>>
>>>>>> The goal of this SPIP is to have an alternative approach dealing with
>>>>>> same workload, given that we no longer have confidence of success of
>>>>>> continuous mode. But I also want to make clear that deprecating and
>>>>>> eventually retiring continuous mode is not a goal of this project. If 
>>>>>> that
>>>>>> happens eventually, that would be a side-effect. Someone may have 
>>>>>> concerns
>>>>>> that we have two different projects aiming for similar thing, but I'd
>>>>>> rather see both projects having competition. If anyone willing to improve
>>>>>> continuous mode can start making the effort right now. This SPIP does not
>>>>>> block it.
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan 
>>>>>> wrote:
>>>>

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Jungtaek Lim
Thanks Chao for driving the release!

On Wed, Nov 30, 2022 at 6:03 PM Wenchen Fan  wrote:

> Thanks, Chao!
>
> On Wed, Nov 30, 2022 at 1:33 AM Chao Sun  wrote:
>
>> We are happy to announce the availability of Apache Spark 3.2.3!
>>
>> Spark 3.2.3 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We
>> strongly
>> recommend all 3.2 users to upgrade to this stable release.
>>
>> To download Spark 3.2.3, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-3.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Chao
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-23 Thread Jungtaek Lim
Hi Mridul,

I'd like to make clear to avoid any misunderstanding - the decision was not
led by me. (I'm just a one of engineers in the team. Not even TL.) As you
see the direction, there was an internal consensus to not revisit the
continuous mode. There are various reasons, which I think we know already.
You seem to remember I have raised concerns about continuous mode, but have
you indicated that it was even over 2 years ago? I still see no traction
around the project. The main reason I abandoned the discussion was due to
promising effort on integrating push based shuffle into continuous mode to
achieve shuffle, but no effort has been made so far.

The goal of this SPIP is to have an alternative approach dealing with same
workload, given that we no longer have confidence of success of continuous
mode. But I also want to make clear that deprecating and eventually
retiring continuous mode is not a goal of this project. If that happens
eventually, that would be a side-effect. Someone may have concerns that we
have two different projects aiming for similar thing, but I'd rather see
both projects having competition. If anyone willing to improve continuous
mode can start making the effort right now. This SPIP does not block it.


On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan 
wrote:

>
> Hi Jungtaek,
>
>   Given the goal of the SPIP is reducing latency for stateless apps, and
> should reasonably fit continuous mode design goals, it feels odd to not
> support it fin the proposal.
>
> I know you have raised concerns about continuous mode in past as well in
> dev@ list, and we are further ignoring it in this proposal (and possibly
> other enhancements in past few releases).
>
> Do you want to revisit the discussion to support it and propose a vote on
> that ? And move it to deprecated ?
>
> I am much more comfortable not supporting this SPIP for CM if it was
> deprecated.
>
> Thoughts ?
>
> Regards,
> Mridul
>
>
>
>
> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng 
> wrote:
>
>> Jungtaek,
>>
>> Thanks for taking up the role to shepard this SPIP!  Thank you for also
>> chiming in on your thoughts concerning the continuous mode!
>>
>> Best,
>>
>> Jerry
>>
>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Just FYI, I'm shepherding this SPIP project.
>>>
>>> I think the major meta question would be, "why don't we spend effort on
>>> continuous mode rather than initiating another feature aiming for the
>>> same workload?". Jerry already updated the doc to answer the question, but
>>> I can also share my thoughts about it.
>>>
>>> I feel like the current "continuous mode" is a niche solution. (It's not
>>> to blame. If you have to deal with such workload but can't rewrite the
>>> underlying engine from scratch, then there are really few options.)
>>> Since the implementation went with a workaround to implement which the
>>> architecture does not support natively e.g. distributed snapshot, it gets
>>> quite tricky on maintaining and expanding the project. It also requires 3rd
>>> parties to implement a separate source and sink implementation, which I'm
>>> not sure how many 3rd parties actually followed so far.
>>>
>>> Eventually, "continuous mode" becomes an area no one in the active
>>> community knows the details and has willingness to maintain. I wouldn't say
>>> we are confident to remove the tag on "experimental", although the feature
>>> has been shipped for years. It was introduced in Spark 2.3, surprising
>>> enough?
>>>
>>> We went back and thought about the approach from scratch. Jerry came up
>>> with the idea which leverages existing microbatch execution, hence
>>> relatively stable and no need to require 3rd parties to support another
>>> mode. It adds complexity against microbatch execution but it's a lot less
>>> complicated compared to the existing continuous mode. Definitely quite less
>>> than creating a new record-to-record engine from scratch.
>>>
>>> That said, we want to propose and move forward with the new approach.
>>>
>>> ps. Eventually we could probably discuss retiring continuous mode if the
>>> new approach gets accepted and eventually considered as a stable one after
>>> several minor releases. That's just me.
>>>
>>> On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to start the discussion for a SPIP, Asynchronous Off

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Jungtaek Lim
Just FYI, I'm shepherding this SPIP project.

I think the major meta question would be, "why don't we spend effort on
continuous mode rather than initiating another feature aiming for the
same workload?". Jerry already updated the doc to answer the question, but
I can also share my thoughts about it.

I feel like the current "continuous mode" is a niche solution. (It's not to
blame. If you have to deal with such workload but can't rewrite the
underlying engine from scratch, then there are really few options.)
Since the implementation went with a workaround to implement which the
architecture does not support natively e.g. distributed snapshot, it gets
quite tricky on maintaining and expanding the project. It also requires 3rd
parties to implement a separate source and sink implementation, which I'm
not sure how many 3rd parties actually followed so far.

Eventually, "continuous mode" becomes an area no one in the active
community knows the details and has willingness to maintain. I wouldn't say
we are confident to remove the tag on "experimental", although the feature
has been shipped for years. It was introduced in Spark 2.3, surprising
enough?

We went back and thought about the approach from scratch. Jerry came up
with the idea which leverages existing microbatch execution, hence
relatively stable and no need to require 3rd parties to support another
mode. It adds complexity against microbatch execution but it's a lot less
complicated compared to the existing continuous mode. Definitely quite less
than creating a new record-to-record engine from scratch.

That said, we want to propose and move forward with the new approach.

ps. Eventually we could probably discuss retiring continuous mode if the
new approach gets accepted and eventually considered as a stable one after
several minor releases. That's just me.

On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng 
wrote:

> Hi all,
>
> I would like to start the discussion for a SPIP, Asynchronous Offset
> Management in Structured Streaming.  The high level summary of the SPIP is
> that currently in Structured Streaming we perform a couple of offset
> management operations for progress tracking purposes synchronously on the
> critical path which can contribute significantly to processing latency.  If
> we were to make these operations asynchronous and less frequent we can
> dramatically improve latency for certain types of workloads.
>
> I have put together a SPIP to implement such a mechanism.  Please take a
> look!
>
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591
>
> SPIP doc:
> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing
>
>
> Best,
>
> Jerry
>


Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Jungtaek Lim
+1

Nice to see the chance for driver to reduce resource usage and increase
stability, especially the fact that the driver is SPOF. It's even promising
to have a future plan to pre-bake the kvstore for SHS from the driver.

Thanks for driving the effort, Gengliang!

On Thu, Nov 17, 2022 at 5:32 AM Chris Nauroth  wrote:

> +1 (non-binding)
>
> Gengliang, thank you for the SPIP.
>
> Chris Nauroth
>
>
> On Wed, Nov 16, 2022 at 4:27 AM Maciej  wrote:
>
>> +1
>>
>> On 11/16/22 13:19, Yuming Wang wrote:
>> > +1, non-binding
>> >
>> > On Wed, Nov 16, 2022 at 8:12 PM Yang,Jie(INF) > > > wrote:
>> >
>> > +1, non-binding
>> >
>> > __ __
>> >
>> > Yang Jie
>> >
>> > __ __
>> >
>> > *发件人**: *Mridul Muralidharan > > >
>> > *日期**: *2022年11月16日星期三17:35
>> > *收件人**: *Kent Yao mailto:y...@apache.org>>
>> > *抄送**: *Gengliang Wang > > >, dev > > >
>> > *主题**: *Re: [VOTE][SPIP] Better Spark UI scalability and Driver
>> > stability for large applications
>> >
>> > __ __
>> >
>> > __ __
>> >
>> > +1
>> >
>> > __ __
>> >
>> > Would be great to see history server performance improvements and
>> > lower resource utilization at driver !
>> >
>> > __ __
>> >
>> > Regards,
>> >
>> > Mridul 
>> >
>> > __ __
>> >
>> > On Wed, Nov 16, 2022 at 2:38 AM Kent Yao > > > wrote:
>> >
>> > +1, non-binding
>> >
>> > Gengliang Wang mailto:ltn...@gmail.com>> 于
>> > 2022年11月16日周三16:36写道:
>> > >
>> > > Hi all,
>> > >
>> > > I’d like to start a vote for SPIP: "Better Spark UI
>> scalability and Driver stability for large applications"
>> > >
>> > > The goal of the SPIP is to improve the Driver's stability by
>> supporting storing Spark's UI data on RocksDB. Furthermore, to fasten the
>> read and write operations on RocksDB, it introduces a new Protobuf
>> serializer.
>> > >
>> > > Please also refer to the following:
>> > >
>> > > Previous discussion in the dev mailing list: [DISCUSS] SPIP:
>> Better Spark UI scalability and Driver stability for large applications
>> > > Design Doc: Better Spark UI scalability and Driver stability
>> for large applications
>> > > JIRA: SPARK-41053
>> > >
>> > >
>> > > Please vote on the SPIP for the next 72 hours:
>> > >
>> > > [ ] +1: Accept the proposal as an official SPIP
>> > > [ ] +0
>> > > [ ] -1: I don’t think this is a good idea because …
>> > >
>> > > Kind Regards,
>> > > Gengliang
>> >
>> >
>>  -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> > 
>> >
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>


Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-18 Thread Jungtaek Lim
No further voice so far. I'm going to submit a PR. Thanks again for the
feedback!

On Mon, Oct 17, 2022 at 9:30 AM Jungtaek Lim 
wrote:

> Thanks Gabor and Dongjoon for supporting this!
>
> Bump to reach more eyes. If there is no further voice on this in a couple
> of days, I'll consider it as a lazy consensus and submit a PR to this.
>
> On Sat, Oct 15, 2022 at 3:32 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> I agree with Jungtaek and Gabor about switching the default value of
>> configurations with the migration guide.
>>
>> Dongjoon
>>
>> On Thu, Oct 13, 2022 at 12:46 AM Gabor Somogyi 
>> wrote:
>>
>>> Hi Jungtaek,
>>>
>>> Good to hear that the new approach is working fine. +1 from my side.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Oct 13, 2022 at 4:12 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I would like to propose flipping the default value of Kafka offset
>>>> fetching config. The context is following:
>>>>
>>>> Before Spark 3.1, there was only one approach on fetching offset, using
>>>> consumer.poll(0). This has been pointed out as a root cause for hang since
>>>> there is no timeout for metadata fetch.
>>>>
>>>> In Spark 3.1, we addressed this via introducing a new approach on
>>>> fetching offset, via SPARK-32032
>>>> <https://issues.apache.org/jira/browse/SPARK-32032>. Since the new
>>>> approach leverages AdminClient and consumer group is no longer needed for
>>>> fetching offset, required security ACLs are loosen.
>>>>
>>>> Reference:
>>>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching
>>>>
>>>> There was some concern about behavioral change on the security model
>>>> hence we couldn't make the new approach by default.
>>>>
>>>> During the time, we have observed various Kafka connector related
>>>> issues which came from old offset fetching (e.g. hang, issues on rebalance
>>>> on customer group, etc.) and we fixed many of these issues via simply
>>>> flipping the config.
>>>>
>>>> Based on this, I would consider the default value as "incorrect". The
>>>> security-related behavioral change would be introduced inevitably (they can
>>>> set topic based ACL rule), but most people will get benefited. IMHO this is
>>>> something we can deal with release/migration note.
>>>>
>>>> Would like to hear the voices on this.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>


Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-16 Thread Jungtaek Lim
Thanks Gabor and Dongjoon for supporting this!

Bump to reach more eyes. If there is no further voice on this in a couple
of days, I'll consider it as a lazy consensus and submit a PR to this.

On Sat, Oct 15, 2022 at 3:32 AM Dongjoon Hyun 
wrote:

> +1
>
> I agree with Jungtaek and Gabor about switching the default value of
> configurations with the migration guide.
>
> Dongjoon
>
> On Thu, Oct 13, 2022 at 12:46 AM Gabor Somogyi 
> wrote:
>
>> Hi Jungtaek,
>>
>> Good to hear that the new approach is working fine. +1 from my side.
>>
>> BR,
>> G
>>
>>
>> On Thu, Oct 13, 2022 at 4:12 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I would like to propose flipping the default value of Kafka offset
>>> fetching config. The context is following:
>>>
>>> Before Spark 3.1, there was only one approach on fetching offset, using
>>> consumer.poll(0). This has been pointed out as a root cause for hang since
>>> there is no timeout for metadata fetch.
>>>
>>> In Spark 3.1, we addressed this via introducing a new approach on
>>> fetching offset, via SPARK-32032
>>> <https://issues.apache.org/jira/browse/SPARK-32032>. Since the new
>>> approach leverages AdminClient and consumer group is no longer needed for
>>> fetching offset, required security ACLs are loosen.
>>>
>>> Reference:
>>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching
>>>
>>> There was some concern about behavioral change on the security model
>>> hence we couldn't make the new approach by default.
>>>
>>> During the time, we have observed various Kafka connector related issues
>>> which came from old offset fetching (e.g. hang, issues on rebalance on
>>> customer group, etc.) and we fixed many of these issues via simply flipping
>>> the config.
>>>
>>> Based on this, I would consider the default value as "incorrect". The
>>> security-related behavioral change would be introduced inevitably (they can
>>> set topic based ACL rule), but most people will get benefited. IMHO this is
>>> something we can deal with release/migration note.
>>>
>>> Would like to hear the voices on this.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


[DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-12 Thread Jungtaek Lim
Hi all,

I would like to propose flipping the default value of Kafka offset fetching
config. The context is following:

Before Spark 3.1, there was only one approach on fetching offset, using
consumer.poll(0). This has been pointed out as a root cause for hang since
there is no timeout for metadata fetch.

In Spark 3.1, we addressed this via introducing a new approach on fetching
offset, via SPARK-32032 <https://issues.apache.org/jira/browse/SPARK-32032>.
Since the new approach leverages AdminClient and consumer group is no
longer needed for fetching offset, required security ACLs are loosen.

Reference:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching

There was some concern about behavioral change on the security model hence
we couldn't make the new approach by default.

During the time, we have observed various Kafka connector related issues
which came from old offset fetching (e.g. hang, issues on rebalance on
customer group, etc.) and we fixed many of these issues via simply flipping
the config.

Based on this, I would consider the default value as "incorrect". The
security-related behavioral change would be introduced inevitably (they can
set topic based ACL rule), but most people will get benefited. IMHO this is
something we can deal with release/migration note.

Would like to hear the voices on this.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: Welcome Yikun Jiang as a Spark committer

2022-10-08 Thread Jungtaek Lim
Congrats!

2022년 10월 8일 (토) 오후 3:24, huaxin gao 님이 작성:

> Congratulations!
>
> On Fri, Oct 7, 2022 at 11:22 PM Yang,Jie(INF)  wrote:
>
>> Congratulations Yikun!
>>
>> Regards,
>> Yang Jie
>> --
>> *发件人:* Mridul Muralidharan 
>> *发送时间:* 2022年10月8日 14:16:02
>> *收件人:* Yuming Wang
>> *抄送:* Hyukjin Kwon; dev; Yikun Jiang
>> *主题:* Re: Welcome Yikun Jiang as a Spark committer
>>
>>
>> Congratulations !
>>
>> Regards,
>> Mridul
>>
>> On Sat, Oct 8, 2022 at 12:19 AM Yuming Wang  wrote:
>>
>>> Congratulations Yikun!
>>>
>>> On Sat, Oct 8, 2022 at 12:40 PM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 The Spark PMC recently added Yikun Jiang as a committer on the project.
 Yikun is the major contributor of the infrastructure and GitHub Actions
 in Apache Spark as well as Kubernates and PySpark.
 He has put a lot of effort into stabilizing and optimizing the builds
 so we all can work together in Apache Spark more
 efficiently and effectively. He's also driving the SPIP for Docker
 official image in Apache Spark as well for users and developers.
 Please join me in welcoming Yikun!




Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Jungtaek Lim
+1

On Thu, Oct 6, 2022 at 5:59 AM Chao Sun  wrote:

> +1
>
> > and specifically may allow us to finally move off of the ancient version
> of Guava (?)
>
> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>
> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
> wrote:
>
>> +1.
>>
>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
>> wrote:
>>
>>> +1.
>>>
>>> Xiao
>>>
>>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>>>
 I'm OK with this. It simplifies maintenance a bit, and specifically may
 allow us to finally move off of the ancient version of Guava (?)

 On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or
> not useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you
> have to
>   > be building against java 7". Otherwise a lot of people with hadoop
> 2.x
>   > clusters won't be able to run your code. If your projects are
> java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
> Binary distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>

>>>
>>> --
>>>
>>>


Re: [Structured Streaming + Kafka] Reduced support for alternative offset management

2022-09-01 Thread Jungtaek Lim
Please consider DStream as old school technology and migrate to Structured
Streaming. There is little effort on DStream, and the most focused one is
Spark SQL, and for streaming workloads, Structured Streaming.
For Kafka integration, the guide doc is here,
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

All questions still apply to Kafka integration on Structured Streaming
though. The main reason we maintain our own checkpoint is to guarantee
fault-tolerance; to provide fault-tolerant semantics, the query should be
able to replay exactly the same data from the latest successful batch. This
is not feasible and unreliable if we rely on the Kafka commit mechanism.

You can still easily construct the custom streaming query listener to
commit the progress to Kafka separately, so that you can also leverage the
ecosystem of Kafka. This project is an example:
https://github.com/HeartSaVioR/spark-sql-kafka-offset-committer

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Aug 30, 2022 at 5:05 PM Martin Andersson 
wrote:

> I was looking around for some documentation regarding how checkpointing
> (or rather, delivery semantics) is done when consuming from kafka with
> structured streaming and I stumbled across this old documentation (that
> still somehow exists in latest versions) at
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#checkpoints.
>
>
> This page (which I assume is from around the time of Spark 2.4?) describes
> that storing offsets using checkpoiting is the *least* reliable method
> and goes further into how to use kafka or an external storage to commit
> offsets.
>
> It also says
>
> If you enable Spark checkpointing, offsets will be stored in the
> checkpoint. (...) Furthermore, you cannot recover from a checkpoint if your
> application code has changed.
>
>
> This all leaves me with several questions:
>
>1. Is the above quote still true for Spark 3, that the checkpoint will
>break if you change the code? How about changing the subscribe pattern?
>
>2. Why was the option to manually commit offsets asynchronously to
>kafka removed when it was deemed more reliable than checkpointing? Not to
>mention that storing offsets in kafka allows you to use all the tools
>offered in the kafka distribution to easily reset/rewind offsets on
>specific topics, which doesn't seem to be possible when using checkpoints.
>
>3. From a user perspective, storing offsets in kafka offers more
>features. From a developer perspective, having to re-implement offset
>storage with checkpointing across several output systems (such as HDFS, AWS
>S3 and other object storages) seems like a lot of unnecessary work and
>re-inventing the wheel.
>Is the discussion leading up to the decision to only support storing
>offsets with checkpointing documented anywhere, perhaps in a jira?
>
> Thanks for your time
>


Re: Welcoming three new PMC members

2022-08-09 Thread Jungtaek Lim
Congrats everyone!

On Wed, Aug 10, 2022 at 8:57 AM Hyukjin Kwon  wrote:

> Congrats everybody!
>
> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan 
> wrote:
>
>>
>> Congratulations !
>> Great to have you join the PMC !!
>>
>> Regards,
>> Mridul
>>
>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan 
>> wrote:
>>
>>> Congratulations
>>>
>>> On Tue, Aug 9, 2022, 11:40 AM Xiao Li  wrote:
>>>
 Hi all,

 The Spark PMC recently voted to add three new PMC members. Join me in
 welcoming them to their new roles!

 New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk

 The Spark PMC

>>>


  1   2   3   4   >