subject:"Apache Spark 3.2 Expectation"

Re: Apache Spark 3.2 Expectation

2021-07-01 Thread Gengliang Wang

Hi all,

I just cut branch-3.2 on Github and created version 3.3.0 on Jira.
When merging PRs on the master branch before 3.2.0 RC, please help
cherry-picking bug fixes and ongoing major features mentioned in this
thread to branch-3.2, thanks!

On Fri, Jul 2, 2021 at 2:31 AM Dongjoon Hyun 
wrote:

> Thank you, Gengliang!
>
> On Wed, Jun 30, 2021 at 10:56 PM Gengliang Wang  wrote:
>
>> Hi all,
>>
>> Just as a gentle reminder, I will do the branch cut tomorrow. Please
>> focus on finalizing the works to land in Spark 3.2.0.
>> After the branch cut, we can still merge the ongoing major features
>> mentioned in this thread. There should no be other new features in branch
>> 3.2.
>> Thanks!
>>
>> On Thu, Jun 17, 2021 at 2:57 PM Hyukjin Kwon  wrote:
>>
>>> *GA -> QA
>>>
>>> On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:
>>>
 I think we would make sure treating these items in the list as
 exceptions from the code freeze, and discourage to push new APIs and
 features though.

 GA period ideally we should focus on bug fixes and polishing.

 It would be great if we can speed up on these items in the list too.

 On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:

> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
> Now we make it clear that it's a soft cut and we can still merge
> important code changes to branch-3.2 before RC. Let's keep the branch cut
> date as July 1st.
>
> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
> wrote:
>
>> > First, I think you are saying "branch-3.2";
>>
>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>
>> > We do strongly prefer to cut the release for Spark 3.2.0 including
>> all the patches under SPARK-30602.
>> > This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>>
>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+
>> as Xiao wrote.
>>
>>
>>
>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>
>>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft cut and the committers still are able to commit to `branch-3.3`
 according to their decisions.
>>>
>>>
>>> First, I think you are saying "branch-3.2";
>>>
>>> Second, the "so cut" means no "code freeze", although we cut the
>>> branch. To avoid releasing half-baked and unready features, the release
>>> manager needs to be very careful when cutting the RC. Based on what is
>>> proposed here, the RC date is the actual code freeze date.
>>>
>>> This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released 
 in
 future Spark 3.2.x patch releases.
>>>
>>>
>>> This is not allowed based on the policy. Only bug fixes can be
>>> merged to the patch releases. Thus, if we know it will introduce major
>>> performance regression, we have to turn the feature off by default.
>>>
>>> Xiao
>>>
>>>
>>>
>>> Min Shen  于2021年6月16日周三 下午3:22写道：
>>>
 Hi Gengliang,

 Thanks for volunteering as the release manager for Spark 3.2.0.
 Regarding the ongoing work of push-based shuffle in SPARK-30602, we
 are close to having all the patches merged to master to enable 
 push-based
 shuffle.
 Currently, there are 2 PRs under SPARK-30602 that are under active
 review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
 We should be able to post the PRs for the other 2 remaining tickets
 (SPARK-32923 and SPARK-35546) early next week.

 The tickets under SPARK-30602 are the minimum set of patches to
 enable push-based shuffle.
 We do have other performance/operability enhancements tickets under
 SPARK-33235 that are needed to fully contribute what we have 
 internally for
 push-based shuffle.
 However, these are optional for enabling push-based shuffle.
 We do strongly prefer to cut the release for Spark 3.2.0 including
 all the patches under SPARK-30602.
 This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released 
 in
 future Spark 3.2.x patch releases.
 I understand the preference of not postponing the branch cut date.
 We will check with Dongjoon regarding the soft cut date and the
 flexibility for including the remaining tickets under SPARK-30602 into
 branch-3.2.

 Best,
 Min

 On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
 wrote:

>
> Thanks Dongjoon. I've

Re: Apache Spark 3.2 Expectation

2021-07-01 Thread Dongjoon Hyun

Thank you, Gengliang!

On Wed, Jun 30, 2021 at 10:56 PM Gengliang Wang  wrote:

> Hi all,
>
> Just as a gentle reminder, I will do the branch cut tomorrow. Please
> focus on finalizing the works to land in Spark 3.2.0.
> After the branch cut, we can still merge the ongoing major features
> mentioned in this thread. There should no be other new features in branch
> 3.2.
> Thanks!
>
> On Thu, Jun 17, 2021 at 2:57 PM Hyukjin Kwon  wrote:
>
>> *GA -> QA
>>
>> On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:
>>
>>> I think we would make sure treating these items in the list as
>>> exceptions from the code freeze, and discourage to push new APIs and
>>> features though.
>>>
>>> GA period ideally we should focus on bug fixes and polishing.
>>>
>>> It would be great if we can speed up on these items in the list too.
>>>
>>>
>>> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>>>
 Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
 Now we make it clear that it's a soft cut and we can still merge
 important code changes to branch-3.2 before RC. Let's keep the branch cut
 date as July 1st.

 On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
 wrote:

> > First, I think you are saying "branch-3.2";
>
> To Xiao. Yes, it's was a typo of "branch-3.2".
>
> > We do strongly prefer to cut the release for Spark 3.2.0 including
> all the patches under SPARK-30602.
> > This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
>
> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+
> as Xiao wrote.
>
>
>
> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>
>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>>> soft cut and the committers still are able to commit to `branch-3.3`
>>> according to their decisions.
>>
>>
>> First, I think you are saying "branch-3.2";
>>
>> Second, the "so cut" means no "code freeze", although we cut the
>> branch. To avoid releasing half-baked and unready features, the release
>> manager needs to be very careful when cutting the RC. Based on what is
>> proposed here, the RC date is the actual code freeze date.
>>
>> This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>
>>
>> This is not allowed based on the policy. Only bug fixes can be merged
>> to the patch releases. Thus, if we know it will introduce major 
>> performance
>> regression, we have to turn the feature off by default.
>>
>> Xiao
>>
>>
>>
>> Min Shen  于2021年6月16日周三 下午3:22写道：
>>
>>> Hi Gengliang,
>>>
>>> Thanks for volunteering as the release manager for Spark 3.2.0.
>>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
>>> are close to having all the patches merged to master to enable 
>>> push-based
>>> shuffle.
>>> Currently, there are 2 PRs under SPARK-30602 that are under active
>>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>>> We should be able to post the PRs for the other 2 remaining tickets
>>> (SPARK-32923 and SPARK-35546) early next week.
>>>
>>> The tickets under SPARK-30602 are the minimum set of patches to
>>> enable push-based shuffle.
>>> We do have other performance/operability enhancements tickets under
>>> SPARK-33235 that are needed to fully contribute what we have internally 
>>> for
>>> push-based shuffle.
>>> However, these are optional for enabling push-based shuffle.
>>> We do strongly prefer to cut the release for Spark 3.2.0 including
>>> all the patches under SPARK-30602.
>>> This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>> I understand the preference of not postponing the branch cut date.
>>> We will check with Dongjoon regarding the soft cut date and the
>>> flexibility for including the remaining tickets under SPARK-30602 into
>>> branch-3.2.
>>>
>>> Best,
>>> Min
>>>
>>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>>> wrote:
>>>

 Thanks Dongjoon. I've talked with Dongjoon offline to know more
 this.
 As it is soft cut date, there is no reason to postpone it.

 It sounds good then to keep original branch cut date.

 Thank you.



 Dongjoon Hyun-2 wrote
 > Thank you for volunteering, Gengliang.
 >
 > Apache Spark 3.2.0 is the first version enabling AQE by default.

Re: Apache Spark 3.2 Expectation

2021-06-30 Thread Gengliang Wang

Hi all,

Just as a gentle reminder, I will do the branch cut tomorrow. Please focus
on finalizing the works to land in Spark 3.2.0.
After the branch cut, we can still merge the ongoing major features
mentioned in this thread. There should no be other new features in branch
3.2.
Thanks!

On Thu, Jun 17, 2021 at 2:57 PM Hyukjin Kwon  wrote:

> *GA -> QA
>
> On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:
>
>> I think we would make sure treating these items in the list as exceptions
>> from the code freeze, and discourage to push new APIs and features though.
>>
>> GA period ideally we should focus on bug fixes and polishing.
>>
>> It would be great if we can speed up on these items in the list too.
>>
>>
>> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>>
>>> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
>>> Now we make it clear that it's a soft cut and we can still merge
>>> important code changes to branch-3.2 before RC. Let's keep the branch cut
>>> date as July 1st.
>>>
>>> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
>>> wrote:
>>>
 > First, I think you are saying "branch-3.2";

 To Xiao. Yes, it's was a typo of "branch-3.2".

 > We do strongly prefer to cut the release for Spark 3.2.0 including
 all the patches under SPARK-30602.
 > This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released in
 future Spark 3.2.x patch releases.

 To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
 Xiao wrote.



 On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:

> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>> soft cut and the committers still are able to commit to `branch-3.3`
>> according to their decisions.
>
>
> First, I think you are saying "branch-3.2";
>
> Second, the "so cut" means no "code freeze", although we cut the
> branch. To avoid releasing half-baked and unready features, the release
> manager needs to be very careful when cutting the RC. Based on what is
> proposed here, the RC date is the actual code freeze date.
>
> This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>
>
> This is not allowed based on the policy. Only bug fixes can be merged
> to the patch releases. Thus, if we know it will introduce major 
> performance
> regression, we have to turn the feature off by default.
>
> Xiao
>
>
>
> Min Shen  于2021年6月16日周三 下午3:22写道：
>
>> Hi Gengliang,
>>
>> Thanks for volunteering as the release manager for Spark 3.2.0.
>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
>> are close to having all the patches merged to master to enable push-based
>> shuffle.
>> Currently, there are 2 PRs under SPARK-30602 that are under active
>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>> We should be able to post the PRs for the other 2 remaining tickets
>> (SPARK-32923 and SPARK-35546) early next week.
>>
>> The tickets under SPARK-30602 are the minimum set of patches to
>> enable push-based shuffle.
>> We do have other performance/operability enhancements tickets under
>> SPARK-33235 that are needed to fully contribute what we have internally 
>> for
>> push-based shuffle.
>> However, these are optional for enabling push-based shuffle.
>> We do strongly prefer to cut the release for Spark 3.2.0 including
>> all the patches under SPARK-30602.
>> This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>> I understand the preference of not postponing the branch cut date.
>> We will check with Dongjoon regarding the soft cut date and the
>> flexibility for including the remaining tickets under SPARK-30602 into
>> branch-3.2.
>>
>> Best,
>> Min
>>
>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>> wrote:
>>
>>>
>>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>>> As it is soft cut date, there is no reason to postpone it.
>>>
>>> It sounds good then to keep original branch cut date.
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Dongjoon Hyun-2 wrote
>>> > Thank you for volunteering, Gengliang.
>>> >
>>> > Apache Spark 3.2.0 is the first version enabling AQE by default.
>>> I'm also
>>> > watching some on-going improvements on that.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL
>>> Adaptive Query
>>> > Execution QA)
>>> >
>>> > To Liang-Chi, I'm -1 for

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon

*GA -> QA

On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:

> I think we would make sure treating these items in the list as exceptions
> from the code freeze, and discourage to push new APIs and features though.
>
> GA period ideally we should focus on bug fixes and polishing.
>
> It would be great if we can speed up on these items in the list too.
>
>
> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>
>> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
>> Now we make it clear that it's a soft cut and we can still merge
>> important code changes to branch-3.2 before RC. Let's keep the branch cut
>> date as July 1st.
>>
>> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
>> wrote:
>>
>>> > First, I think you are saying "branch-3.2";
>>>
>>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>>
>>> > We do strongly prefer to cut the release for Spark 3.2.0 including
>>> all the patches under SPARK-30602.
>>> > This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>>
>>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
>>> Xiao wrote.
>>>
>>>
>>>
>>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>>
 To Liang-Chi, I'm -1 for postponing the branch cut because this is a
> soft cut and the committers still are able to commit to `branch-3.3`
> according to their decisions.


 First, I think you are saying "branch-3.2";

 Second, the "so cut" means no "code freeze", although we cut the
 branch. To avoid releasing half-baked and unready features, the release
 manager needs to be very careful when cutting the RC. Based on what is
 proposed here, the RC date is the actual code freeze date.

 This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.


 This is not allowed based on the policy. Only bug fixes can be merged
 to the patch releases. Thus, if we know it will introduce major performance
 regression, we have to turn the feature off by default.

 Xiao



 Min Shen  于2021年6月16日周三 下午3:22写道：

> Hi Gengliang,
>
> Thanks for volunteering as the release manager for Spark 3.2.0.
> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
> are close to having all the patches merged to master to enable push-based
> shuffle.
> Currently, there are 2 PRs under SPARK-30602 that are under active
> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
> We should be able to post the PRs for the other 2 remaining tickets
> (SPARK-32923 and SPARK-35546) early next week.
>
> The tickets under SPARK-30602 are the minimum set of patches to enable
> push-based shuffle.
> We do have other performance/operability enhancements tickets under
> SPARK-33235 that are needed to fully contribute what we have internally 
> for
> push-based shuffle.
> However, these are optional for enabling push-based shuffle.
> We do strongly prefer to cut the release for Spark 3.2.0 including all
> the patches under SPARK-30602.
> This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
> I understand the preference of not postponing the branch cut date.
> We will check with Dongjoon regarding the soft cut date and the
> flexibility for including the remaining tickets under SPARK-30602 into
> branch-3.2.
>
> Best,
> Min
>
> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
> wrote:
>
>>
>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>> As it is soft cut date, there is no reason to postpone it.
>>
>> It sounds good then to keep original branch cut date.
>>
>> Thank you.
>>
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Thank you for volunteering, Gengliang.
>> >
>> > Apache Spark 3.2.0 is the first version enabling AQE by default.
>> I'm also
>> > watching some on-going improvements on that.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL
>> Adaptive Query
>> > Execution QA)
>> >
>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is
>> a soft
>> > cut and the committers still are able to commit to `branch-3.3`
>> according
>> > to their decisions.
>> >
>> > Given that Apache Spark had 115 commits in a week in various areas
>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>> > branch-3.3 and allowing only limited backporting.
>> >
>> > https://github.com/apache/spark/graphs/commit-activity
>>

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon

I think we would make sure treating these items in the list as exceptions
from the code freeze, and discourage to push new APIs and features though.

GA period ideally we should focus on bug fixes and polishing.

It would be great if we can speed up on these items in the list too.

On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:

> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
> Now we make it clear that it's a soft cut and we can still merge important
> code changes to branch-3.2 before RC. Let's keep the branch cut date as
> July 1st.
>
> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
> wrote:
>
>> > First, I think you are saying "branch-3.2";
>>
>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>
>> > We do strongly prefer to cut the release for Spark 3.2.0 including all
>> the patches under SPARK-30602.
>> > This way, we can backport the other performance/operability
>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>> future Spark 3.2.x patch releases.
>>
>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
>> Xiao wrote.
>>
>>
>>
>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>
>>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft cut and the committers still are able to commit to `branch-3.3`
 according to their decisions.
>>>
>>>
>>> First, I think you are saying "branch-3.2";
>>>
>>> Second, the "so cut" means no "code freeze", although we cut the branch.
>>> To avoid releasing half-baked and unready features, the release
>>> manager needs to be very careful when cutting the RC. Based on what is
>>> proposed here, the RC date is the actual code freeze date.
>>>
>>> This way, we can backport the other performance/operability enhancements
 tickets under SPARK-33235 into branch-3.2 to be released in future Spark
 3.2.x patch releases.
>>>
>>>
>>> This is not allowed based on the policy. Only bug fixes can be merged to
>>> the patch releases. Thus, if we know it will introduce major performance
>>> regression, we have to turn the feature off by default.
>>>
>>> Xiao
>>>
>>>
>>>
>>> Min Shen  于2021年6月16日周三 下午3:22写道：
>>>
 Hi Gengliang,

 Thanks for volunteering as the release manager for Spark 3.2.0.
 Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
 close to having all the patches merged to master to enable push-based
 shuffle.
 Currently, there are 2 PRs under SPARK-30602 that are under active
 review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
 We should be able to post the PRs for the other 2 remaining tickets
 (SPARK-32923 and SPARK-35546) early next week.

 The tickets under SPARK-30602 are the minimum set of patches to enable
 push-based shuffle.
 We do have other performance/operability enhancements tickets under
 SPARK-33235 that are needed to fully contribute what we have internally for
 push-based shuffle.
 However, these are optional for enabling push-based shuffle.
 We do strongly prefer to cut the release for Spark 3.2.0 including all
 the patches under SPARK-30602.
 This way, we can backport the other performance/operability
 enhancements tickets under SPARK-33235 into branch-3.2 to be released in
 future Spark 3.2.x patch releases.
 I understand the preference of not postponing the branch cut date.
 We will check with Dongjoon regarding the soft cut date and the
 flexibility for including the remaining tickets under SPARK-30602 into
 branch-3.2.

 Best,
 Min

 On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
 wrote:

>
> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
> As it is soft cut date, there is no reason to postpone it.
>
> It sounds good then to keep original branch cut date.
>
> Thank you.
>
>
>
> Dongjoon Hyun-2 wrote
> > Thank you for volunteering, Gengliang.
> >
> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
> also
> > watching some on-going improvements on that.
> >
> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
> Query
> > Execution QA)
> >
> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
> soft
> > cut and the committers still are able to commit to `branch-3.3`
> according
> > to their decisions.
> >
> > Given that Apache Spark had 115 commits in a week in various areas
> > concurrently, we should start QA for Apache Spark 3.2 by creating
> > branch-3.3 and allowing only limited backporting.
> >
> > https://github.com/apache/spark/graphs/commit-activity
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >> First, thanks for being

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Gengliang Wang

Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
Now we make it clear that it's a soft cut and we can still merge important
code changes to branch-3.2 before RC. Let's keep the branch cut date as
July 1st.

On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
wrote:

> > First, I think you are saying "branch-3.2";
>
> To Xiao. Yes, it's was a typo of "branch-3.2".
>
> > We do strongly prefer to cut the release for Spark 3.2.0 including all
> the patches under SPARK-30602.
> > This way, we can backport the other performance/operability
> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
> future Spark 3.2.x patch releases.
>
> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
> Xiao wrote.
>
>
>
> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>
>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
>>> cut and the committers still are able to commit to `branch-3.3` according
>>> to their decisions.
>>
>>
>> First, I think you are saying "branch-3.2";
>>
>> Second, the "so cut" means no "code freeze", although we cut the branch.
>> To avoid releasing half-baked and unready features, the release
>> manager needs to be very careful when cutting the RC. Based on what is
>> proposed here, the RC date is the actual code freeze date.
>>
>> This way, we can backport the other performance/operability enhancements
>>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>>> 3.2.x patch releases.
>>
>>
>> This is not allowed based on the policy. Only bug fixes can be merged to
>> the patch releases. Thus, if we know it will introduce major performance
>> regression, we have to turn the feature off by default.
>>
>> Xiao
>>
>>
>>
>> Min Shen  于2021年6月16日周三 下午3:22写道：
>>
>>> Hi Gengliang,
>>>
>>> Thanks for volunteering as the release manager for Spark 3.2.0.
>>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
>>> close to having all the patches merged to master to enable push-based
>>> shuffle.
>>> Currently, there are 2 PRs under SPARK-30602 that are under active
>>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>>> We should be able to post the PRs for the other 2 remaining tickets
>>> (SPARK-32923 and SPARK-35546) early next week.
>>>
>>> The tickets under SPARK-30602 are the minimum set of patches to enable
>>> push-based shuffle.
>>> We do have other performance/operability enhancements tickets under
>>> SPARK-33235 that are needed to fully contribute what we have internally for
>>> push-based shuffle.
>>> However, these are optional for enabling push-based shuffle.
>>> We do strongly prefer to cut the release for Spark 3.2.0 including all
>>> the patches under SPARK-30602.
>>> This way, we can backport the other performance/operability enhancements
>>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>>> 3.2.x patch releases.
>>> I understand the preference of not postponing the branch cut date.
>>> We will check with Dongjoon regarding the soft cut date and the
>>> flexibility for including the remaining tickets under SPARK-30602 into
>>> branch-3.2.
>>>
>>> Best,
>>> Min
>>>
>>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>>> wrote:
>>>

 Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
 As it is soft cut date, there is no reason to postpone it.

 It sounds good then to keep original branch cut date.

 Thank you.



 Dongjoon Hyun-2 wrote
 > Thank you for volunteering, Gengliang.
 >
 > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
 also
 > watching some on-going improvements on that.
 >
 > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
 Query
 > Execution QA)
 >
 > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
 soft
 > cut and the committers still are able to commit to `branch-3.3`
 according
 > to their decisions.
 >
 > Given that Apache Spark had 115 commits in a week in various areas
 > concurrently, we should start QA for Apache Spark 3.2 by creating
 > branch-3.3 and allowing only limited backporting.
 >
 > https://github.com/apache/spark/graphs/commit-activity
 >
 > Bests,
 > Dongjoon.
 >
 >
 > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 

 > viirya@

 >  wrote:
 >
 >> First, thanks for being volunteer as the release manager of Spark
 3.2.0,
 >> Gengliang!
 >>
 >> And yes, for the two important Structured Streaming features, RocksDB
 >> StateStore and session window, we're working on them and expect to
 have
 >> them
 >> in the new release.
 >>
 >> So I propose to postpone the branch cut date.
 >>
 >> Thank you!
 >>
 >> Liang-Chi
 >>
 >>
 >> Gengliang Wang-2 wrote
 >> > Thanks, Hyukjin.

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Dongjoon Hyun

> First, I think you are saying "branch-3.2";

To Xiao. Yes, it's was a typo of "branch-3.2".

> We do strongly prefer to cut the release for Spark 3.2.0 including all
the patches under SPARK-30602.
> This way, we can backport the other performance/operability enhancements
tickets under SPARK-33235 into branch-3.2 to be released in future Spark
3.2.x patch releases.

To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
Xiao wrote.



On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:

> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
>> cut and the committers still are able to commit to `branch-3.3` according
>> to their decisions.
>
>
> First, I think you are saying "branch-3.2";
>
> Second, the "so cut" means no "code freeze", although we cut the branch.
> To avoid releasing half-baked and unready features, the release
> manager needs to be very careful when cutting the RC. Based on what is
> proposed here, the RC date is the actual code freeze date.
>
> This way, we can backport the other performance/operability enhancements
>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>> 3.2.x patch releases.
>
>
> This is not allowed based on the policy. Only bug fixes can be merged to
> the patch releases. Thus, if we know it will introduce major performance
> regression, we have to turn the feature off by default.
>
> Xiao
>
>
>
> Min Shen  于2021年6月16日周三 下午3:22写道：
>
>> Hi Gengliang,
>>
>> Thanks for volunteering as the release manager for Spark 3.2.0.
>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
>> close to having all the patches merged to master to enable push-based
>> shuffle.
>> Currently, there are 2 PRs under SPARK-30602 that are under active review
>> (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>> We should be able to post the PRs for the other 2 remaining tickets
>> (SPARK-32923 and SPARK-35546) early next week.
>>
>> The tickets under SPARK-30602 are the minimum set of patches to enable
>> push-based shuffle.
>> We do have other performance/operability enhancements tickets under
>> SPARK-33235 that are needed to fully contribute what we have internally for
>> push-based shuffle.
>> However, these are optional for enabling push-based shuffle.
>> We do strongly prefer to cut the release for Spark 3.2.0 including all
>> the patches under SPARK-30602.
>> This way, we can backport the other performance/operability enhancements
>> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
>> 3.2.x patch releases.
>> I understand the preference of not postponing the branch cut date.
>> We will check with Dongjoon regarding the soft cut date and the
>> flexibility for including the remaining tickets under SPARK-30602 into
>> branch-3.2.
>>
>> Best,
>> Min
>>
>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh  wrote:
>>
>>>
>>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>>> As it is soft cut date, there is no reason to postpone it.
>>>
>>> It sounds good then to keep original branch cut date.
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Dongjoon Hyun-2 wrote
>>> > Thank you for volunteering, Gengliang.
>>> >
>>> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
>>> also
>>> > watching some on-going improvements on that.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
>>> Query
>>> > Execution QA)
>>> >
>>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>>> soft
>>> > cut and the committers still are able to commit to `branch-3.3`
>>> according
>>> > to their decisions.
>>> >
>>> > Given that Apache Spark had 115 commits in a week in various areas
>>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>>> > branch-3.3 and allowing only limited backporting.
>>> >
>>> > https://github.com/apache/spark/graphs/commit-activity
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> >
>>> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>>>
>>> > viirya@
>>>
>>> >  wrote:
>>> >
>>> >> First, thanks for being volunteer as the release manager of Spark
>>> 3.2.0,
>>> >> Gengliang!
>>> >>
>>> >> And yes, for the two important Structured Streaming features, RocksDB
>>> >> StateStore and session window, we're working on them and expect to
>>> have
>>> >> them
>>> >> in the new release.
>>> >>
>>> >> So I propose to postpone the branch cut date.
>>> >>
>>> >> Thank you!
>>> >>
>>> >> Liang-Chi
>>> >>
>>> >>
>>> >> Gengliang Wang-2 wrote
>>> >> > Thanks, Hyukjin.
>>> >> >
>>> >> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>>> >> > https://spark.apache.org/versioning-policy.html. However, I notice
>>> that
>>> >> > there are still multiple important projects in progress now:
>>> >> >
>>> >> > [Core]
>>> >> >
>>> >> >- SPIP: Support push-based shuffle to improve shuffle efficiency
>>> >> >https://issues.apache.org/jira/browse/SPARK-30602;
>>> >> >
>>> >> > [SQL]

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Xiao Li

>
> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
> cut and the committers still are able to commit to `branch-3.3` according
> to their decisions.


First, I think you are saying "branch-3.2";

Second, the "so cut" means no "code freeze", although we cut the branch. To
avoid releasing half-baked and unready features, the release
manager needs to be very careful when cutting the RC. Based on what is
proposed here, the RC date is the actual code freeze date.

This way, we can backport the other performance/operability enhancements
> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
> 3.2.x patch releases.


This is not allowed based on the policy. Only bug fixes can be merged to
the patch releases. Thus, if we know it will introduce major performance
regression, we have to turn the feature off by default.

Xiao



Min Shen  于2021年6月16日周三 下午3:22写道：

> Hi Gengliang,
>
> Thanks for volunteering as the release manager for Spark 3.2.0.
> Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
> close to having all the patches merged to master to enable push-based
> shuffle.
> Currently, there are 2 PRs under SPARK-30602 that are under active review
> (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
> We should be able to post the PRs for the other 2 remaining tickets
> (SPARK-32923 and SPARK-35546) early next week.
>
> The tickets under SPARK-30602 are the minimum set of patches to enable
> push-based shuffle.
> We do have other performance/operability enhancements tickets under
> SPARK-33235 that are needed to fully contribute what we have internally for
> push-based shuffle.
> However, these are optional for enabling push-based shuffle.
> We do strongly prefer to cut the release for Spark 3.2.0 including all the
> patches under SPARK-30602.
> This way, we can backport the other performance/operability enhancements
> tickets under SPARK-33235 into branch-3.2 to be released in future Spark
> 3.2.x patch releases.
> I understand the preference of not postponing the branch cut date.
> We will check with Dongjoon regarding the soft cut date and the
> flexibility for including the remaining tickets under SPARK-30602 into
> branch-3.2.
>
> Best,
> Min
>
> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh  wrote:
>
>>
>> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
>> As it is soft cut date, there is no reason to postpone it.
>>
>> It sounds good then to keep original branch cut date.
>>
>> Thank you.
>>
>>
>>
>> Dongjoon Hyun-2 wrote
>> > Thank you for volunteering, Gengliang.
>> >
>> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
>> also
>> > watching some on-going improvements on that.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
>> Query
>> > Execution QA)
>> >
>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>> soft
>> > cut and the committers still are able to commit to `branch-3.3`
>> according
>> > to their decisions.
>> >
>> > Given that Apache Spark had 115 commits in a week in various areas
>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>> > branch-3.3 and allowing only limited backporting.
>> >
>> > https://github.com/apache/spark/graphs/commit-activity
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> >
>> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>>
>> > viirya@
>>
>> >  wrote:
>> >
>> >> First, thanks for being volunteer as the release manager of Spark
>> 3.2.0,
>> >> Gengliang!
>> >>
>> >> And yes, for the two important Structured Streaming features, RocksDB
>> >> StateStore and session window, we're working on them and expect to have
>> >> them
>> >> in the new release.
>> >>
>> >> So I propose to postpone the branch cut date.
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi
>> >>
>> >>
>> >> Gengliang Wang-2 wrote
>> >> > Thanks, Hyukjin.
>> >> >
>> >> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>> >> > https://spark.apache.org/versioning-policy.html. However, I notice
>> that
>> >> > there are still multiple important projects in progress now:
>> >> >
>> >> > [Core]
>> >> >
>> >> >- SPIP: Support push-based shuffle to improve shuffle efficiency
>> >> >https://issues.apache.org/jira/browse/SPARK-30602;
>> >> >
>> >> > [SQL]
>> >> >
>> >> >- Support ANSI SQL INTERVAL types
>> >> >https://issues.apache.org/jira/browse/SPARK-27790;
>> >> >- Support Timestamp without time zone data type
>> >> >https://issues.apache.org/jira/browse/SPARK-35662;
>> >> >- Aggregate (Min/Max/Count) push down for Parquet
>> >> >https://issues.apache.org/jira/browse/SPARK-34952;
>> >> >
>> >> > [Streaming]
>> >> >
>> >> >- EventTime based sessionization (session window)
>> >> >https://issues.apache.org/jira/browse/SPARK-10816;
>> >> >- Add RocksDB StateStore as external module
>> >> >https://issues.apache.org/jira/browse/SPARK-34198;
>> >> >
>> >>

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Min Shen

Hi Gengliang,

Thanks for volunteering as the release manager for Spark 3.2.0.
Regarding the ongoing work of push-based shuffle in SPARK-30602, we are
close to having all the patches merged to master to enable push-based
shuffle.
Currently, there are 2 PRs under SPARK-30602 that are under active review
(SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
We should be able to post the PRs for the other 2 remaining tickets
(SPARK-32923 and SPARK-35546) early next week.

The tickets under SPARK-30602 are the minimum set of patches to enable
push-based shuffle.
We do have other performance/operability enhancements tickets under
SPARK-33235 that are needed to fully contribute what we have internally for
push-based shuffle.
However, these are optional for enabling push-based shuffle.
We do strongly prefer to cut the release for Spark 3.2.0 including all the
patches under SPARK-30602.
This way, we can backport the other performance/operability enhancements
tickets under SPARK-33235 into branch-3.2 to be released in future Spark
3.2.x patch releases.
I understand the preference of not postponing the branch cut date.
We will check with Dongjoon regarding the soft cut date and the flexibility
for including the remaining tickets under SPARK-30602 into branch-3.2.

Best,
Min

On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh  wrote:

>
> Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
> As it is soft cut date, there is no reason to postpone it.
>
> It sounds good then to keep original branch cut date.
>
> Thank you.
>
>
>
> Dongjoon Hyun-2 wrote
> > Thank you for volunteering, Gengliang.
> >
> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm also
> > watching some on-going improvements on that.
> >
> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
> Query
> > Execution QA)
> >
> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
> > cut and the committers still are able to commit to `branch-3.3` according
> > to their decisions.
> >
> > Given that Apache Spark had 115 commits in a week in various areas
> > concurrently, we should start QA for Apache Spark 3.2 by creating
> > branch-3.3 and allowing only limited backporting.
> >
> > https://github.com/apache/spark/graphs/commit-activity
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>
> > viirya@
>
> >  wrote:
> >
> >> First, thanks for being volunteer as the release manager of Spark 3.2.0,
> >> Gengliang!
> >>
> >> And yes, for the two important Structured Streaming features, RocksDB
> >> StateStore and session window, we're working on them and expect to have
> >> them
> >> in the new release.
> >>
> >> So I propose to postpone the branch cut date.
> >>
> >> Thank you!
> >>
> >> Liang-Chi
> >>
> >>
> >> Gengliang Wang-2 wrote
> >> > Thanks, Hyukjin.
> >> >
> >> > The expected target branch cut date of Spark 3.2 is *July 1st* on
> >> > https://spark.apache.org/versioning-policy.html. However, I notice
> that
> >> > there are still multiple important projects in progress now:
> >> >
> >> > [Core]
> >> >
> >> >- SPIP: Support push-based shuffle to improve shuffle efficiency
> >> >https://issues.apache.org/jira/browse/SPARK-30602;
> >> >
> >> > [SQL]
> >> >
> >> >- Support ANSI SQL INTERVAL types
> >> >https://issues.apache.org/jira/browse/SPARK-27790;
> >> >- Support Timestamp without time zone data type
> >> >https://issues.apache.org/jira/browse/SPARK-35662;
> >> >- Aggregate (Min/Max/Count) push down for Parquet
> >> >https://issues.apache.org/jira/browse/SPARK-34952;
> >> >
> >> > [Streaming]
> >> >
> >> >- EventTime based sessionization (session window)
> >> >https://issues.apache.org/jira/browse/SPARK-10816;
> >> >- Add RocksDB StateStore as external module
> >> >https://issues.apache.org/jira/browse/SPARK-34198;
> >> >
> >> >
> >> > I wonder whether we should postpone the branch cut date.
> >> > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
> >> > Li, Liang-Chi Hsieh, who work on the projects above.
> >> >
> >> > On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 
> >>
> >> > gurwls223@
> >>
> >> >  wrote:
> >> >
> >> >> +1, thanks.
> >> >>
> >> >> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 
> >>
> >> > ltnwgl@
> >>
> >> >  wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> As the expected release date is close,  I would like to volunteer as
> >> the
> >> >>> release manager for Apache Spark 3.2.0.
> >> >>>
> >> >>> Thanks,
> >> >>> Gengliang
> >> >>>
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >>
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh



Thanks Dongjoon. I've talked with Dongjoon offline to know more this.
As it is soft cut date, there is no reason to postpone it.

It sounds good then to keep original branch cut date.

Thank you.



Dongjoon Hyun-2 wrote
> Thank you for volunteering, Gengliang.
> 
> Apache Spark 3.2.0 is the first version enabling AQE by default. I'm also
> watching some on-going improvements on that.
> 
> https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive Query
> Execution QA)
> 
> To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
> cut and the committers still are able to commit to `branch-3.3` according
> to their decisions.
> 
> Given that Apache Spark had 115 commits in a week in various areas
> concurrently, we should start QA for Apache Spark 3.2 by creating
> branch-3.3 and allowing only limited backporting.
> 
> https://github.com/apache/spark/graphs/commit-activity
> 
> Bests,
> Dongjoon.
> 
> 
> On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 

> viirya@

>  wrote:
> 
>> First, thanks for being volunteer as the release manager of Spark 3.2.0,
>> Gengliang!
>>
>> And yes, for the two important Structured Streaming features, RocksDB
>> StateStore and session window, we're working on them and expect to have
>> them
>> in the new release.
>>
>> So I propose to postpone the branch cut date.
>>
>> Thank you!
>>
>> Liang-Chi
>>
>>
>> Gengliang Wang-2 wrote
>> > Thanks, Hyukjin.
>> >
>> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>> > https://spark.apache.org/versioning-policy.html. However, I notice that
>> > there are still multiple important projects in progress now:
>> >
>> > [Core]
>> >
>> >- SPIP: Support push-based shuffle to improve shuffle efficiency
>> >https://issues.apache.org/jira/browse/SPARK-30602;
>> >
>> > [SQL]
>> >
>> >- Support ANSI SQL INTERVAL types
>> >https://issues.apache.org/jira/browse/SPARK-27790;
>> >- Support Timestamp without time zone data type
>> >https://issues.apache.org/jira/browse/SPARK-35662;
>> >- Aggregate (Min/Max/Count) push down for Parquet
>> >https://issues.apache.org/jira/browse/SPARK-34952;
>> >
>> > [Streaming]
>> >
>> >- EventTime based sessionization (session window)
>> >https://issues.apache.org/jira/browse/SPARK-10816;
>> >- Add RocksDB StateStore as external module
>> >https://issues.apache.org/jira/browse/SPARK-34198;
>> >
>> >
>> > I wonder whether we should postpone the branch cut date.
>> > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
>> > Li, Liang-Chi Hsieh, who work on the projects above.
>> >
>> > On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 
>>
>> > gurwls223@
>>
>> >  wrote:
>> >
>> >> +1, thanks.
>> >>
>> >> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 
>>
>> > ltnwgl@
>>
>> >  wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> As the expected release date is close,  I would like to volunteer as
>> the
>> >>> release manager for Apache Spark 3.2.0.
>> >>>
>> >>> Thanks,
>> >>> Gengliang
>> >>>
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Dongjoon Hyun

Thank you for volunteering, Gengliang.

Apache Spark 3.2.0 is the first version enabling AQE by default. I'm also
watching some on-going improvements on that.

https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive Query
Execution QA)

To Liang-Chi, I'm -1 for postponing the branch cut because this is a soft
cut and the committers still are able to commit to `branch-3.3` according
to their decisions.

Given that Apache Spark had 115 commits in a week in various areas
concurrently, we should start QA for Apache Spark 3.2 by creating
branch-3.3 and allowing only limited backporting.

https://github.com/apache/spark/graphs/commit-activity

Bests,
Dongjoon.


On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh  wrote:

> First, thanks for being volunteer as the release manager of Spark 3.2.0,
> Gengliang!
>
> And yes, for the two important Structured Streaming features, RocksDB
> StateStore and session window, we're working on them and expect to have
> them
> in the new release.
>
> So I propose to postpone the branch cut date.
>
> Thank you!
>
> Liang-Chi
>
>
> Gengliang Wang-2 wrote
> > Thanks, Hyukjin.
> >
> > The expected target branch cut date of Spark 3.2 is *July 1st* on
> > https://spark.apache.org/versioning-policy.html. However, I notice that
> > there are still multiple important projects in progress now:
> >
> > [Core]
> >
> >- SPIP: Support push-based shuffle to improve shuffle efficiency
> >https://issues.apache.org/jira/browse/SPARK-30602;
> >
> > [SQL]
> >
> >- Support ANSI SQL INTERVAL types
> >https://issues.apache.org/jira/browse/SPARK-27790;
> >- Support Timestamp without time zone data type
> >https://issues.apache.org/jira/browse/SPARK-35662;
> >- Aggregate (Min/Max/Count) push down for Parquet
> >https://issues.apache.org/jira/browse/SPARK-34952;
> >
> > [Streaming]
> >
> >- EventTime based sessionization (session window)
> >https://issues.apache.org/jira/browse/SPARK-10816;
> >- Add RocksDB StateStore as external module
> >https://issues.apache.org/jira/browse/SPARK-34198;
> >
> >
> > I wonder whether we should postpone the branch cut date.
> > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
> > Li, Liang-Chi Hsieh, who work on the projects above.
> >
> > On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 
>
> > gurwls223@
>
> >  wrote:
> >
> >> +1, thanks.
> >>
> >> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 
>
> > ltnwgl@
>
> >  wrote:
> >>
> >>> Hi,
> >>>
> >>> As the expected release date is close,  I would like to volunteer as
> the
> >>> release manager for Apache Spark 3.2.0.
> >>>
> >>> Thanks,
> >>> Gengliang
> >>>
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Liang-Chi Hsieh

First, thanks for being volunteer as the release manager of Spark 3.2.0,
Gengliang!

And yes, for the two important Structured Streaming features, RocksDB
StateStore and session window, we're working on them and expect to have them
in the new release.

So I propose to postpone the branch cut date.

Thank you!

Liang-Chi


Gengliang Wang-2 wrote
> Thanks, Hyukjin.
> 
> The expected target branch cut date of Spark 3.2 is *July 1st* on
> https://spark.apache.org/versioning-policy.html. However, I notice that
> there are still multiple important projects in progress now:
> 
> [Core]
> 
>- SPIP: Support push-based shuffle to improve shuffle efficiency
>https://issues.apache.org/jira/browse/SPARK-30602;
> 
> [SQL]
> 
>- Support ANSI SQL INTERVAL types
>https://issues.apache.org/jira/browse/SPARK-27790;
>- Support Timestamp without time zone data type
>https://issues.apache.org/jira/browse/SPARK-35662;
>- Aggregate (Min/Max/Count) push down for Parquet
>https://issues.apache.org/jira/browse/SPARK-34952;
> 
> [Streaming]
> 
>- EventTime based sessionization (session window)
>https://issues.apache.org/jira/browse/SPARK-10816;
>- Add RocksDB StateStore as external module
>https://issues.apache.org/jira/browse/SPARK-34198;
> 
> 
> I wonder whether we should postpone the branch cut date.
> cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
> Li, Liang-Chi Hsieh, who work on the projects above.
> 
> On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 

> gurwls223@

>  wrote:
> 
>> +1, thanks.
>>
>> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 

> ltnwgl@

>  wrote:
>>
>>> Hi,
>>>
>>> As the expected release date is close,  I would like to volunteer as the
>>> release manager for Apache Spark 3.2.0.
>>>
>>> Thanks,
>>> Gengliang
>>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.2 Expectation

2021-06-16 Thread Gengliang Wang

Thanks, Hyukjin.

The expected target branch cut date of Spark 3.2 is *July 1st* on
https://spark.apache.org/versioning-policy.html. However, I notice that
there are still multiple important projects in progress now:

[Core]

   - SPIP: Support push-based shuffle to improve shuffle efficiency
   

[SQL]

   - Support ANSI SQL INTERVAL types
   
   - Support Timestamp without time zone data type
   
   - Aggregate (Min/Max/Count) push down for Parquet
   

[Streaming]

   - EventTime based sessionization (session window)
   
   - Add RocksDB StateStore as external module
   


I wonder whether we should postpone the branch cut date.
cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
Li, Liang-Chi Hsieh, who work on the projects above.

On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon  wrote:

> +1, thanks.
>
> On Tue, 15 Jun 2021, 16:17 Gengliang Wang,  wrote:
>
>> Hi,
>>
>> As the expected release date is close,  I would like to volunteer as the
>> release manager for Apache Spark 3.2.0.
>>
>> Thanks,
>> Gengliang
>>
>> On Mon, Apr 12, 2021 at 1:59 PM Wenchen Fan  wrote:
>>
>>> An update: we found a mistake that we picked the Spark 3.2 release date
>>> based on the scheduled release date of 3.1. However, 3.1 was delayed and
>>> released on March 2. In order to have a full 6 months development for 3.2,
>>> the target release date for 3.2 should be September 2.
>>>
>>> I'm updating the release dates in
>>> https://github.com/apache/spark-website/pull/331
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, Xiao, Wenchen and Hyukjin.

 Bests,
 Dongjoon.


 On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon 
 wrote:

> Just for an update, I will send a discussion email about my idea late
> this week or early next week.
>
> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:
>
>> There are many projects going on right now, such as new DS v2 APIs,
>> ANSI interval types, join improvement, disaggregated shuffle, etc. I 
>> don't
>> think it's realistic to do the branch cut in April.
>>
>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
>> the branch 3 months earlier. We should make the release process faster 
>> and
>> cut the branch around June probably.
>>
>>
>>
>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>>
>>> Below are some nice-to-have features we can work on in Spark 3.2: 
>>> Lateral
>>> Join support ,
>>> interval data type, timestamp without time zone, un-nesting arbitrary
>>> queries, the returned metrics of DSV2, and error message 
>>> standardization.
>>> Spark 3.2 will be another exciting release I believe!
>>>
>>> Go Spark!
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>>>
 Hi, Xiao.

 This thread started 13 days ago. Since you asked the community
 about major features or timelines at that time, could you share your
 roadmap or expectations if you have something in your mind?

 > Thank you, Dongjoon, for initiating this discussion. Let us keep
 it open. It might take 1-2 weeks to collect from the community all the
 features we plan to build and ship in 3.2 since we just finished the 
 3.1
 voting.
 > TBH, cutting the branch this April does not look good to me. That
 means, we only have one month left for feature development of Spark 
 3.2. Do
 we have enough features in the current master branch? If not, are we 
 able
 to finish major features we collected here? Do they have a timeline or
 project plan?

 Bests,
 Dongjoon.



 On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Hi, John.
>
> This thread aims to share your expectations and goals (and maybe
> work progress) to Apache Spark 3.2 because we are making this 
> together. :)
>
> Bests,
> Dongjoon.
>
>
> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge 
> wrote:
>
>> Hi Dongjoon,
>>
>> Is it possible to get ViewCatalog in? The community already had
>> fairly detailed discussions.
>>
>> Thanks,
>> John
>>
>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:

Re: Apache Spark 3.2 Expectation

2021-06-15 Thread Hyukjin Kwon

+1, thanks.

On Tue, 15 Jun 2021, 16:17 Gengliang Wang,  wrote:

> Hi,
>
> As the expected release date is close,  I would like to volunteer as the
> release manager for Apache Spark 3.2.0.
>
> Thanks,
> Gengliang
>
> On Mon, Apr 12, 2021 at 1:59 PM Wenchen Fan  wrote:
>
>> An update: we found a mistake that we picked the Spark 3.2 release date
>> based on the scheduled release date of 3.1. However, 3.1 was delayed and
>> released on March 2. In order to have a full 6 months development for 3.2,
>> the target release date for 3.2 should be September 2.
>>
>> I'm updating the release dates in
>> https://github.com/apache/spark-website/pull/331
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Xiao, Wenchen and Hyukjin.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Just for an update, I will send a discussion email about my idea late
 this week or early next week.

 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:

> There are many projects going on right now, such as new DS v2 APIs,
> ANSI interval types, join improvement, disaggregated shuffle, etc. I don't
> think it's realistic to do the branch cut in April.
>
> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
> the branch 3 months earlier. We should make the release process faster and
> cut the branch around June probably.
>
>
>
> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>
>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>> Join support ,
>> interval data type, timestamp without time zone, un-nesting arbitrary
>> queries, the returned metrics of DSV2, and error message standardization.
>> Spark 3.2 will be another exciting release I believe!
>>
>> Go Spark!
>>
>> Xiao
>>
>>
>>
>>
>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>>
>>> Hi, Xiao.
>>>
>>> This thread started 13 days ago. Since you asked the community about
>>> major features or timelines at that time, could you share your roadmap 
>>> or
>>> expectations if you have something in your mind?
>>>
>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep
>>> it open. It might take 1-2 weeks to collect from the community all the
>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>> voting.
>>> > TBH, cutting the branch this April does not look good to me. That
>>> means, we only have one month left for feature development of Spark 
>>> 3.2. Do
>>> we have enough features in the current master branch? If not, are we 
>>> able
>>> to finish major features we collected here? Do they have a timeline or
>>> project plan?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, John.

 This thread aims to share your expectations and goals (and maybe
 work progress) to Apache Spark 3.2 because we are making this 
 together. :)

 Bests,
 Dongjoon.


 On Wed, Mar 3, 2021 at 1:59 PM John Zhuge 
 wrote:

> Hi Dongjoon,
>
> Is it possible to get ViewCatalog in? The community already had
> fairly detailed discussions.
>
> Thanks,
> John
>
> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch
>> since December 2020, March seems to be a good time to share our 
>> thoughts
>> and aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache
>> Spark 3.2 seems to be the last minor release of this year. Given the
>> timeframe, we might consider the following. (This is a small set. 
>> Please
>> add your thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075
>> but slipped out. Currently, we are trying to use Scala 2.13.5 via
>> SPARK-34505 and investigating the publishing issue. Thank you for 
>> your
>> contributions and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>> Like Java 11, we need lots of support from our dependencies. Let's 
>> see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had 
>>

Re: Apache Spark 3.2 Expectation

2021-06-15 Thread Gengliang Wang

Hi,

As the expected release date is close,  I would like to volunteer as the
release manager for Apache Spark 3.2.0.

Thanks,
Gengliang

On Mon, Apr 12, 2021 at 1:59 PM Wenchen Fan  wrote:

> An update: we found a mistake that we picked the Spark 3.2 release date
> based on the scheduled release date of 3.1. However, 3.1 was delayed and
> released on March 2. In order to have a full 6 months development for 3.2,
> the target release date for 3.2 should be September 2.
>
> I'm updating the release dates in
> https://github.com/apache/spark-website/pull/331
>
> Thanks,
> Wenchen
>
> On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun 
> wrote:
>
>> Thank you, Xiao, Wenchen and Hyukjin.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon  wrote:
>>
>>> Just for an update, I will send a discussion email about my idea late
>>> this week or early next week.
>>>
>>> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:
>>>
 There are many projects going on right now, such as new DS v2 APIs,
 ANSI interval types, join improvement, disaggregated shuffle, etc. I don't
 think it's realistic to do the branch cut in April.

 I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
 the branch 3 months earlier. We should make the release process faster and
 cut the branch around June probably.



 On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:

> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
> Join support ,
> interval data type, timestamp without time zone, un-nesting arbitrary
> queries, the returned metrics of DSV2, and error message standardization.
> Spark 3.2 will be another exciting release I believe!
>
> Go Spark!
>
> Xiao
>
>
>
>
> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>
>> Hi, Xiao.
>>
>> This thread started 13 days ago. Since you asked the community about
>> major features or timelines at that time, could you share your roadmap or
>> expectations if you have something in your mind?
>>
>> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
>> open. It might take 1-2 weeks to collect from the community all the
>> features we plan to build and ship in 3.2 since we just finished the 3.1
>> voting.
>> > TBH, cutting the branch this April does not look good to me. That
>> means, we only have one month left for feature development of Spark 3.2. 
>> Do
>> we have enough features in the current master branch? If not, are we able
>> to finish major features we collected here? Do they have a timeline or
>> project plan?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, John.
>>>
>>> This thread aims to share your expectations and goals (and maybe
>>> work progress) to Apache Spark 3.2 because we are making this together. 
>>> :)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>>>
 Hi Dongjoon,

 Is it possible to get ViewCatalog in? The community already had
 fairly detailed discussions.

 Thanks,
 John

 On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch
> since December 2020, March seems to be a good time to share our 
> thoughts
> and aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache
> Spark 3.2 seems to be the last minor release of this year. Given the
> timeframe, we might consider the following. (This is a small set. 
> Please
> add your thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
> slipped out. Currently, we are trying to use Scala 2.13.5 via 
> SPARK-34505
> and investigating the publishing issue. Thank you for your 
> contributions
> and feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
> Like Java 11, we need lots of support from our dependencies. Let's 
> see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far.
> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
> publishing.

Re: Apache Spark 3.2 Expectation

2021-04-11 Thread Wenchen Fan

An update: we found a mistake that we picked the Spark 3.2 release date
based on the scheduled release date of 3.1. However, 3.1 was delayed and
released on March 2. In order to have a full 6 months development for 3.2,
the target release date for 3.2 should be September 2.

I'm updating the release dates in
https://github.com/apache/spark-website/pull/331

Thanks,
Wenchen

On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun 
wrote:

> Thank you, Xiao, Wenchen and Hyukjin.
>
> Bests,
> Dongjoon.
>
>
> On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon  wrote:
>
>> Just for an update, I will send a discussion email about my idea late
>> this week or early next week.
>>
>> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:
>>
>>> There are many projects going on right now, such as new DS v2 APIs, ANSI
>>> interval types, join improvement, disaggregated shuffle, etc. I don't
>>> think it's realistic to do the branch cut in April.
>>>
>>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
>>> the branch 3 months earlier. We should make the release process faster and
>>> cut the branch around June probably.
>>>
>>>
>>>
>>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>>>
 Below are some nice-to-have features we can work on in Spark 3.2: Lateral
 Join support ,
 interval data type, timestamp without time zone, un-nesting arbitrary
 queries, the returned metrics of DSV2, and error message standardization.
 Spark 3.2 will be another exciting release I believe!

 Go Spark!

 Xiao




 Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：

> Hi, Xiao.
>
> This thread started 13 days ago. Since you asked the community about
> major features or timelines at that time, could you share your roadmap or
> expectations if you have something in your mind?
>
> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
> open. It might take 1-2 weeks to collect from the community all the
> features we plan to build and ship in 3.2 since we just finished the 3.1
> voting.
> > TBH, cutting the branch this April does not look good to me. That
> means, we only have one month left for feature development of Spark 3.2. 
> Do
> we have enough features in the current master branch? If not, are we able
> to finish major features we collected here? Do they have a timeline or
> project plan?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
> wrote:
>
>> Hi, John.
>>
>> This thread aims to share your expectations and goals (and maybe work
>> progress) to Apache Spark 3.2 because we are making this together. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Is it possible to get ViewCatalog in? The community already had
>>> fairly detailed discussions.
>>>
>>> Thanks,
>>> John
>>>
>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All.

 Since we have been preparing Apache Spark 3.2.0 in master branch
 since December 2020, March seems to be a good time to share our 
 thoughts
 and aspirations on Apache Spark 3.2.

 According to the progress on Apache Spark 3.1 release, Apache Spark
 3.2 seems to be the last minor release of this year. Given the 
 timeframe,
 we might consider the following. (This is a small set. Please add your
 thoughts to this limited list.)

 # Languages

 - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
 slipped out. Currently, we are trying to use Scala 2.13.5 via 
 SPARK-34505
 and investigating the publishing issue. Thank you for your 
 contributions
 and feedback on this.

 - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
 Like Java 11, we need lots of support from our dependencies. Let's see.

 - Python 3.6 Deprecation(?): Python 3.6 community support ends at
 2021-12-23. So, the deprecation is not required yet, but we had better
 prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

 - SparkR CRAN publishing: As we know, it's discontinued so far.
 Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
 publishing.
 If it succeeds to revive it, we can keep publishing. Otherwise, I 
 believe
 we had better drop it from the releasing work item list officially.

 # Dependencies

 - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
 profile in Apache Spark 3.1. Currently, Spark master branch lives on

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Dongjoon Hyun

Thank you, Xiao, Wenchen and Hyukjin.

Bests,
Dongjoon.


On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon  wrote:

> Just for an update, I will send a discussion email about my idea late this
> week or early next week.
>
> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:
>
>> There are many projects going on right now, such as new DS v2 APIs, ANSI
>> interval types, join improvement, disaggregated shuffle, etc. I don't
>> think it's realistic to do the branch cut in April.
>>
>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the
>> branch 3 months earlier. We should make the release process faster and cut
>> the branch around June probably.
>>
>>
>>
>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>>
>>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>>> Join support ,
>>> interval data type, timestamp without time zone, un-nesting arbitrary
>>> queries, the returned metrics of DSV2, and error message standardization.
>>> Spark 3.2 will be another exciting release I believe!
>>>
>>> Go Spark!
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>>>
 Hi, Xiao.

 This thread started 13 days ago. Since you asked the community about
 major features or timelines at that time, could you share your roadmap or
 expectations if you have something in your mind?

 > Thank you, Dongjoon, for initiating this discussion. Let us keep it
 open. It might take 1-2 weeks to collect from the community all the
 features we plan to build and ship in 3.2 since we just finished the 3.1
 voting.
 > TBH, cutting the branch this April does not look good to me. That
 means, we only have one month left for feature development of Spark 3.2. Do
 we have enough features in the current master branch? If not, are we able
 to finish major features we collected here? Do they have a timeline or
 project plan?

 Bests,
 Dongjoon.



 On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
 wrote:

> Hi, John.
>
> This thread aims to share your expectations and goals (and maybe work
> progress) to Apache Spark 3.2 because we are making this together. :)
>
> Bests,
> Dongjoon.
>
>
> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>
>> Hi Dongjoon,
>>
>> Is it possible to get ViewCatalog in? The community already had
>> fairly detailed discussions.
>>
>> Thanks,
>> John
>>
>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>> since December 2020, March seems to be a good time to share our thoughts
>>> and aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark
>>> 3.2 seems to be the last minor release of this year. Given the 
>>> timeframe,
>>> we might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via 
>>> SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>>> publishing.
>>> If it succeeds to revive it, we can keep publishing. Otherwise, I 
>>> believe
>>> we had better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>>> profile in Apache Spark 3.1. Currently, Spark master branch lives on 
>>> Hadoop
>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going
>>> report at YARN environment. We hope it will be fixed soon at Spark 3.2
>>> timeframe and we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
>>> completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with 
>>> the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Hyukjin Kwon

Just for an update, I will send a discussion email about my idea late this
week or early next week.

2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:

> There are many projects going on right now, such as new DS v2 APIs, ANSI
> interval types, join improvement, disaggregated shuffle, etc. I don't
> think it's realistic to do the branch cut in April.
>
> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the
> branch 3 months earlier. We should make the release process faster and cut
> the branch around June probably.
>
>
>
> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>
>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>> Join support ,
>> interval data type, timestamp without time zone, un-nesting arbitrary
>> queries, the returned metrics of DSV2, and error message standardization.
>> Spark 3.2 will be another exciting release I believe!
>>
>> Go Spark!
>>
>> Xiao
>>
>>
>>
>>
>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>>
>>> Hi, Xiao.
>>>
>>> This thread started 13 days ago. Since you asked the community about
>>> major features or timelines at that time, could you share your roadmap or
>>> expectations if you have something in your mind?
>>>
>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>> open. It might take 1-2 weeks to collect from the community all the
>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>> voting.
>>> > TBH, cutting the branch this April does not look good to me. That
>>> means, we only have one month left for feature development of Spark 3.2. Do
>>> we have enough features in the current master branch? If not, are we able
>>> to finish major features we collected here? Do they have a timeline or
>>> project plan?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, John.

 This thread aims to share your expectations and goals (and maybe work
 progress) to Apache Spark 3.2 because we are making this together. :)

 Bests,
 Dongjoon.


 On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:

> Hi Dongjoon,
>
> Is it possible to get ViewCatalog in? The community already had fairly
> detailed discussions.
>
> Thanks,
> John
>
> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch
>> since December 2020, March seems to be a good time to share our thoughts
>> and aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark
>> 3.2 seems to be the last minor release of this year. Given the timeframe,
>> we might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>> publishing.
>> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
>> we had better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>> profile in Apache Spark 3.1. Currently, Spark master branch lives on 
>> Hadoop
>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going
>> report at YARN environment. We hope it will be fixed soon at Spark 3.2
>> timeframe and we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
>> completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Wenchen Fan

There are many projects going on right now, such as new DS v2 APIs, ANSI
interval types, join improvement, disaggregated shuffle, etc. I don't
think it's realistic to do the branch cut in April.

I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the
branch 3 months earlier. We should make the release process faster and cut
the branch around June probably.



On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:

> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
> Join support ,
> interval data type, timestamp without time zone, un-nesting arbitrary
> queries, the returned metrics of DSV2, and error message standardization.
> Spark 3.2 will be another exciting release I believe!
>
> Go Spark!
>
> Xiao
>
>
>
>
> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：
>
>> Hi, Xiao.
>>
>> This thread started 13 days ago. Since you asked the community about
>> major features or timelines at that time, could you share your roadmap or
>> expectations if you have something in your mind?
>>
>> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
>> open. It might take 1-2 weeks to collect from the community all the
>> features we plan to build and ship in 3.2 since we just finished the 3.1
>> voting.
>> > TBH, cutting the branch this April does not look good to me. That
>> means, we only have one month left for feature development of Spark 3.2. Do
>> we have enough features in the current master branch? If not, are we able
>> to finish major features we collected here? Do they have a timeline or
>> project plan?
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, John.
>>>
>>> This thread aims to share your expectations and goals (and maybe work
>>> progress) to Apache Spark 3.2 because we are making this together. :)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>>>
 Hi Dongjoon,

 Is it possible to get ViewCatalog in? The community already had fairly
 detailed discussions.

 Thanks,
 John

 On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark
> 3.2 seems to be the last minor release of this year. Given the timeframe,
> we might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
> and investigating the publishing issue. Thank you for your contributions
> and feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far.
> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
> we had better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 
> 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
> completely
> via SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
> support K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using
> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with
> Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.

Re: Apache Spark 3.2 Expectation

2021-03-10 Thread Xiao Li

Below are some nice-to-have features we can work on in Spark 3.2: Lateral
Join support , interval
data type, timestamp without time zone, un-nesting arbitrary queries, the
returned metrics of DSV2, and error message standardization. Spark 3.2 will
be another exciting release I believe!

Go Spark!

Xiao




Dongjoon Hyun  于2021年3月10日周三 下午12:25写道：

> Hi, Xiao.
>
> This thread started 13 days ago. Since you asked the community about major
> features or timelines at that time, could you share your roadmap or
> expectations if you have something in your mind?
>
> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
> open. It might take 1-2 weeks to collect from the community all the
> features we plan to build and ship in 3.2 since we just finished the 3.1
> voting.
> > TBH, cutting the branch this April does not look good to me. That means,
> we only have one month left for feature development of Spark 3.2. Do we
> have enough features in the current master branch? If not, are we able to
> finish major features we collected here? Do they have a timeline or project
> plan?
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
> wrote:
>
>> Hi, John.
>>
>> This thread aims to share your expectations and goals (and maybe work
>> progress) to Apache Spark 3.2 because we are making this together. :)
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> Is it possible to get ViewCatalog in? The community already had fairly
>>> detailed discussions.
>>>
>>> Thanks,
>>> John
>>>
>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 Since we have been preparing Apache Spark 3.2.0 in master branch since
 December 2020, March seems to be a good time to share our thoughts and
 aspirations on Apache Spark 3.2.

 According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
 seems to be the last minor release of this year. Given the timeframe, we
 might consider the following. (This is a small set. Please add your
 thoughts to this limited list.)

 # Languages

 - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
 slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
 and investigating the publishing issue. Thank you for your contributions
 and feedback on this.

 - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
 Java 11, we need lots of support from our dependencies. Let's see.

 - Python 3.6 Deprecation(?): Python 3.6 community support ends at
 2021-12-23. So, the deprecation is not required yet, but we had better
 prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

 - SparkR CRAN publishing: As we know, it's discontinued so far.
 Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
 If it succeeds to revive it, we can keep publishing. Otherwise, I believe
 we had better drop it from the releasing work item list officially.

 # Dependencies

 - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
 in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
 shaded clients via SPARK-33212. So far, there is one on-going report at
 YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
 we can move toward Hadoop 3.3.2.

 - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
 instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
 via SPARK-32981 and replaced the generated hive-service-rpc code with the
 official dependency via SPARK-32981. We are steadily improving this area
 and will consume Hive 2.3.9 if available.

 - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
 client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
 support K8s model 1.19.

 - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
 Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
 KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
 with Kafka Client 2.8 hopefully.

 # Some Features

 - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
 Iceberg integration. Especially, we hope the on-going function catalog SPIP
 and up-coming storage partitioned join SPIP can be delivered as a part of
 Spark 3.2 and become an additional foundation.

 - Columnar Encryption: As of today, Apache Spark master branch supports
 columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
 Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,

Re: Apache Spark 3.2 Expectation

2021-03-10 Thread Dongjoon Hyun

Hi, Xiao.

This thread started 13 days ago. Since you asked the community about major
features or timelines at that time, could you share your roadmap or
expectations if you have something in your mind?

> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
It might take 1-2 weeks to collect from the community all the features
we plan to build and ship in 3.2 since we just finished the 3.1 voting.
> TBH, cutting the branch this April does not look good to me. That means,
we only have one month left for feature development of Spark 3.2. Do we
have enough features in the current master branch? If not, are we able to
finish major features we collected here? Do they have a timeline or project
plan?

Bests,
Dongjoon.



On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
wrote:

> Hi, John.
>
> This thread aims to share your expectations and goals (and maybe work
> progress) to Apache Spark 3.2 because we are making this together. :)
>
> Bests,
> Dongjoon.
>
>
> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:
>
>> Hi Dongjoon,
>>
>> Is it possible to get ViewCatalog in? The community already had fairly
>> detailed discussions.
>>
>> Thanks,
>> John
>>
>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread Dongjoon Hyun

Hi, John.

This thread aims to share your expectations and goals (and maybe work
progress) to Apache Spark 3.2 because we are making this together. :)

Bests,
Dongjoon.


On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:

> Hi Dongjoon,
>
> Is it possible to get ViewCatalog in? The community already had fairly
> detailed discussions.
>
> Thanks,
> John
>
> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>> December 2020, March seems to be a good time to share our thoughts and
>> aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>> seems to be the last minor release of this year. Given the timeframe, we
>> might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>> Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>> better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
>> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>> shaded clients via SPARK-33212. So far, there is one on-going report at
>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>> we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>> with Kafka Client 2.8 hopefully.
>>
>> # Some Features
>>
>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>> and up-coming storage partitioned join SPIP can be delivered as a part of
>> Spark 3.2 and become an additional foundation.
>>
>> - Columnar Encryption: As of today, Apache Spark master branch supports
>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>> Apache Spark 3.2 is going to be the first release to have this feature
>> officially. Any feedback is welcome.
>>
>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>> too. I'm expecting more benefits.
>>
>> - Structure Streaming with RocksDB backend: According to the latest
>> update, it looks active enough for merging to master branch in Spark 3.2.
>>
>> Please share your thoughts and let's build better Apache Spark 3.2
>> together.
>>
>> Bests,
>> Dongjoon.
>>
>
>
> --
> John Zhuge
>

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread John Zhuge

Hi Dongjoon,

Is it possible to get ViewCatalog in? The community already had fairly
detailed discussions.

Thanks,
John

On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>


-- 
John Zhuge

Re: Apache Spark 3.2 Expectation

2021-03-03 Thread Chang Chen

+1 for Data Source V2 Aggregate push down

huaxin gao  于2021年2月27日周六 上午4:20写道：

> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
> Source V2 Aggregate push down to the list. I am currently working on
> JDBC Data Source V2 Aggregate push down, but the common code can be used
> for the file based V2 Data Source as well. For example, MAX and MIN can be
> pushed down to Parquet and Orc, since they can use statistics information
> to perform these operations efficiently. Quite a few users are
> interested in this Aggregate push down feature and the preliminary
> performance test for JDBC Aggregate push down is positive. So I think it is
> a valuable feature to add for Spark 3.2.
>
> Thanks,
> Huaxin
>
> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:
>
>> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
>> It might take 1-2 weeks to collect from the community all the features
>> we plan to build and ship in 3.2 since we just finished the 3.1 voting.
>>
>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>
>>
>> TBH, cutting the branch this April does not look good to me. That means,
>> we only have one month left for feature development of Spark 3.2. Do we
>> have enough features in the current master branch? If not, are we able to
>> finish major features we collected here? Do they have a timeline or project
>> plan?
>>
>> Xiao
>>
>> Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：
>>
>>> Thank you, Mridul and Sean.
>>>
>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>>> course, it's a nice-to-have status. :)
>>>
>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>> for sharing,
>>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>> Let's update our release roadmap of the Apache Spark website.
>>>
>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>>>
 I'd roughly expect 3.2 in, say, July of this year, given the usual
 cadence. No reason it couldn't be a little sooner or later. There is
 already some good stuff in 3.2 and will be a good minor release in 5-6
 months.

 On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark
> 3.2 seems to be the last minor release of this year. Given the timeframe,
> we might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
> and investigating the publishing issue. Thank you for your contributions
> and feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far.
> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
> we had better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 
> 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
> completely
> via SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
> client dependency to

Re: Apache Spark 3.2 Expectation

2021-02-28 Thread bo yang

+1 for better support for disaggregated shuffle (push-based shuffle is a
great example, also there are Facebook shuffle service

and Uber remote shuffle service
). There were previously some
community sync up meetings on this, but kind of discontinued. Are people
interested to continue the sync up meeting on this?

On Fri, Feb 26, 2021 at 6:41 PM Yi Wu  wrote:

> +1 to continue the incompleted push-based shuffle.
>
> --
> Yi
>
> On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan 
> wrote:
>
>>
>>
>> Nit: Java 17 -> should be available by Sept 2021 :-)
>> Adoption would also depend on some of our nontrivial dependencies
>> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>>
>> Features:
>> Push based shuffle and disaggregated shuffle should also be in 3.2
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>> too. I'm expecting more benefits.
>>>
>>> - Structure Streaming with RocksDB backend: According to the latest
>>> update, it looks active enough for merging to master

Re: Apache Spark 3.2 Expectation

2021-02-28 Thread Takeshi Yamamuro

Thanks, Dongjoon, for the discussion.
I would like to add Gengliang's work: SPARK-34246 New type coercion syntax
rules in ANSI mode
I think it is worth describing it in the next release note, too.

Bests,
Takeshi

On Sat, Feb 27, 2021 at 11:41 AM Yi Wu  wrote:

> +1 to continue the incompleted push-based shuffle.
>
> --
> Yi
>
> On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan 
> wrote:
>
>>
>>
>> Nit: Java 17 -> should be available by Sept 2021 :-)
>> Adoption would also depend on some of our nontrivial dependencies
>> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>>
>> Features:
>> Push based shuffle and disaggregated shuffle should also be in 3.2
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>> too. I'm expecting more benefits.
>>>
>>> - Structure Streaming with RocksDB backend: According to the latest
>>> update, it looks active enough for merging to master branch in Spark 3.2.
>>>
>>> Please share your thoughts and let's build better Apache Spark 3.2
>>> together.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

-- 
---
Takeshi Yamamuro

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Yi Wu

+1 to continue the incompleted push-based shuffle.

--
Yi

On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan 
wrote:

>
>
> Nit: Java 17 -> should be available by Sept 2021 :-)
> Adoption would also depend on some of our nontrivial dependencies
> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>
> Features:
> Push based shuffle and disaggregated shuffle should also be in 3.2
>
>
> Regards,
> Mridul
>
>
>
>
>
>
> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>> December 2020, March seems to be a good time to share our thoughts and
>> aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>> seems to be the last minor release of this year. Given the timeframe, we
>> might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>> Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>> better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
>> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>> shaded clients via SPARK-33212. So far, there is one on-going report at
>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>> we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>> with Kafka Client 2.8 hopefully.
>>
>> # Some Features
>>
>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>> and up-coming storage partitioned join SPIP can be delivered as a part of
>> Spark 3.2 and become an additional foundation.
>>
>> - Columnar Encryption: As of today, Apache Spark master branch supports
>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>> Apache Spark 3.2 is going to be the first release to have this feature
>> officially. Any feedback is welcome.
>>
>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>> too. I'm expecting more benefits.
>>
>> - Structure Streaming with RocksDB backend: According to the latest
>> update, it looks active enough for merging to master branch in Spark 3.2.
>>
>> Please share your thoughts and let's build better Apache Spark 3.2
>> together.
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Cheng Su

Hi,

Just want to share something I am working on in 3.2 if these matter.

  *   Shuffled hash join improvement (SPARK-32461)
 *   This is one of release notes JIRAs in 3.1, and major thing left is 
sort-based fallback and code-gen for FULL OUTER join.
  *   Join and aggregation code-gen (SPARK-34287 and more to create)
 *   Add code-gen for all join types of sort merge join, object hash 
aggregation and sort aggregation.
  *   Write Hive/Presto-compatible bucketed table (SPARK-19256)
 *   This is a long-standing issue and we made progress on plan during 3.1 
development. We ideally want to finish the feature in 3.2.

For most of features here, we already developed internally and rolled out to 
production.

Thanks,
Cheng Su

From: Dongjoon Hyun 
Date: Friday, February 26, 2021 at 4:06 PM
To: Hyukjin Kwon 
Cc: huaxin gao , Xiao Li , dev 

Subject: Re: Apache Spark 3.2 Expectation

Sure, thank you, Hyukjin.

Bests,
Dongjoon.

On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
I have an idea which I'll send an email to discuss next or a week after the 
next week. I did not have enough bandwidth to drive both together at the same 
time. I would appreciate if we have some more time for 3.2.

In addition, It would also be great if we follow the schedule and catch 
potential blockers quickly during QA instead of when we cut RCs. That will 
considerably speed up the process and make it on time.

Thanks.

On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun, 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you for sharing your plan, Huaxin!

Bests,
Dongjoon.

On Fri, Feb 26, 2021 at 12:20 PM huaxin gao 
mailto:huaxin.ga...@gmail.com>> wrote:
Thanks Dongjoon and Xiao for the discussion. I would like to add Data Source V2 
Aggregate push down to the list. I am currently working on JDBC Data Source V2 
Aggregate push down, but the common code can be used for the file based V2 Data 
Source as well. For example, MAX and MIN can be pushed down to Parquet and Orc, 
since they can use statistics information to perform these operations 
efficiently. Quite a few users are interested in this Aggregate push down 
feature and the preliminary performance test for JDBC Aggregate push down is 
positive. So I think it is a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Thank you, Dongjoon, for initiating this discussion. Let us keep it open. It 
might take 1-2 weeks to collect from the community all the features we plan to 
build and ship in 3.2 since we just finished the 3.1 voting.

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in 
April because we took 3 month for Spark 3.1 release.

TBH, cutting the branch this April does not look good to me. That means, we 
only have one month left for feature development of Spark 3.2. Do we have 
enough features in the current master branch? If not, are we able to finish 
major features we collected here? Do they have a timeline or project plan?

Xiao

Dongjoon Hyun mailto:dongjoon.h...@gmail.com>> 
于2021年2月26日周五 上午10:07写道：
Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of course, 
it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for 
sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in 
April because we took 3 month for Spark 3.1 release.
Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. 
> No reason it couldn't be a little sooner or later. There is already some good 
> stuff in 3.2 and will be a good minor release in 5-6 months.

Bests,
Dongjoon.

On Thu, Feb 25, 2021 at 9:33 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
I'd roughly expect 3.2 in, say, July of this year, given the usual cadence. No 
reason it couldn't be a little sooner or later. There is already some good 
stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since December 
2020, March seems to be a good time to share our thoughts and aspirations on 
Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2 seems 
to be the last minor release of this year. Given the timeframe, we might 
consider the following. (This is a small set. Please add your thoughts to this 
limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped out. 
Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and investigating 
the publishing issue. Thank you for your contributions and feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 20

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Dongjoon Hyun

Sure, thank you, Hyukjin.

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 4:01 PM Hyukjin Kwon  wrote:

> I have an idea which I'll send an email to discuss next or a week after
> the next week. I did not have enough bandwidth to drive both together at
> the same time. I would appreciate if we have some more time for 3.2.
>
> In addition, It would also be great if we follow the schedule and catch
> potential blockers quickly during QA instead of when we cut RCs. That will
> considerably speed up the process and make it on time.
>
> Thanks.
>
>
> On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun,  wrote:
>
>> Thank you for sharing your plan, Huaxin!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, Feb 26, 2021 at 12:20 PM huaxin gao 
>> wrote:
>>
>>> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
>>> Source V2 Aggregate push down to the list. I am currently working on
>>> JDBC Data Source V2 Aggregate push down, but the common code can be used
>>> for the file based V2 Data Source as well. For example, MAX and MIN can be
>>> pushed down to Parquet and Orc, since they can use statistics information
>>> to perform these operations efficiently. Quite a few users are
>>> interested in this Aggregate push down feature and the preliminary
>>> performance test for JDBC Aggregate push down is positive. So I think it is
>>> a valuable feature to add for Spark 3.2.
>>>
>>> Thanks,
>>> Huaxin
>>>
>>> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:
>>>
 Thank you, Dongjoon, for initiating this discussion. Let us keep it
 open. It might take 1-2 weeks to collect from the community all the
 features we plan to build and ship in 3.2 since we just finished the 3.1
 voting.


> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
> `branch-cut` in April because we took 3 month for Spark 3.1 release.


 TBH, cutting the branch this April does not look good to me. That
 means, we only have one month left for feature development of Spark 3.2. Do
 we have enough features in the current master branch? If not, are we able
 to finish major features we collected here? Do they have a timeline or
 project plan?

 Xiao

 Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：

> Thank you, Mridul and Sean.
>
> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And,
> of course, it's a nice-to-have status. :)
>
> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
> for sharing,
>
> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
> `branch-cut` in April because we took 3 month for Spark 3.1 release.
> Let's update our release roadmap of the Apache Spark website.
>
> > I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> Bests,
> Dongjoon.
>
>
>
> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>
>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>> cadence. No reason it couldn't be a little sooner or later. There is
>> already some good stuff in 3.2 and will be a good minor release in 5-6
>> months.
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>> since December 2020, March seems to be a good time to share our thoughts
>>> and aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark
>>> 3.2 seems to be the last minor release of this year. Given the 
>>> timeframe,
>>> we might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via 
>>> SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>>> publishing.
>>> If it succeeds to revive it, we can keep publishing. Otherwise, I 
>>> believe
>>> we had better drop it from the

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Hyukjin Kwon

I have an idea which I'll send an email to discuss next or a week after the
next week. I did not have enough bandwidth to drive both together at the
same time. I would appreciate if we have some more time for 3.2.

In addition, It would also be great if we follow the schedule and catch
potential blockers quickly during QA instead of when we cut RCs. That will
considerably speed up the process and make it on time.

Thanks.


On Sat, 27 Feb 2021, 06:00 Dongjoon Hyun,  wrote:

> Thank you for sharing your plan, Huaxin!
>
> Bests,
> Dongjoon.
>
>
> On Fri, Feb 26, 2021 at 12:20 PM huaxin gao 
> wrote:
>
>> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
>> Source V2 Aggregate push down to the list. I am currently working on
>> JDBC Data Source V2 Aggregate push down, but the common code can be used
>> for the file based V2 Data Source as well. For example, MAX and MIN can be
>> pushed down to Parquet and Orc, since they can use statistics information
>> to perform these operations efficiently. Quite a few users are
>> interested in this Aggregate push down feature and the preliminary
>> performance test for JDBC Aggregate push down is positive. So I think it is
>> a valuable feature to add for Spark 3.2.
>>
>> Thanks,
>> Huaxin
>>
>> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:
>>
>>> Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>> open. It might take 1-2 weeks to collect from the community all the
>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>> voting.
>>>
>>>
 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
 `branch-cut` in April because we took 3 month for Spark 3.1 release.
>>>
>>>
>>> TBH, cutting the branch this April does not look good to me. That means,
>>> we only have one month left for feature development of Spark 3.2. Do we
>>> have enough features in the current master branch? If not, are we able to
>>> finish major features we collected here? Do they have a timeline or project
>>> plan?
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：
>>>
 Thank you, Mridul and Sean.

 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
 course, it's a nice-to-have status. :)

 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
 for sharing,

 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need
 `branch-cut` in April because we took 3 month for Spark 3.1 release.
 Let's update our release roadmap of the Apache Spark website.

 > I'd roughly expect 3.2 in, say, July of this year, given the usual
 cadence. No reason it couldn't be a little sooner or later. There is
 already some good stuff in 3.2 and will be a good minor release in 5-6
 months.

 Bests,
 Dongjoon.



 On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:

> I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch
>> since December 2020, March seems to be a good time to share our thoughts
>> and aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark
>> 3.2 seems to be the last minor release of this year. Given the timeframe,
>> we might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>> publishing.
>> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
>> we had better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>> profile in Apache Spark 3.1. Currently, Spark master branch lives on 
>> Hadoop
>> 3.2.2's shaded

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Dongjoon Hyun

Thank you for sharing your plan, Huaxin!

Bests,
Dongjoon.


On Fri, Feb 26, 2021 at 12:20 PM huaxin gao  wrote:

> Thanks Dongjoon and Xiao for the discussion. I would like to add Data
> Source V2 Aggregate push down to the list. I am currently working on
> JDBC Data Source V2 Aggregate push down, but the common code can be used
> for the file based V2 Data Source as well. For example, MAX and MIN can be
> pushed down to Parquet and Orc, since they can use statistics information
> to perform these operations efficiently. Quite a few users are
> interested in this Aggregate push down feature and the preliminary
> performance test for JDBC Aggregate push down is positive. So I think it is
> a valuable feature to add for Spark 3.2.
>
> Thanks,
> Huaxin
>
> On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:
>
>> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
>> It might take 1-2 weeks to collect from the community all the features
>> we plan to build and ship in 3.2 since we just finished the 3.1 voting.
>>
>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>
>>
>> TBH, cutting the branch this April does not look good to me. That means,
>> we only have one month left for feature development of Spark 3.2. Do we
>> have enough features in the current master branch? If not, are we able to
>> finish major features we collected here? Do they have a timeline or project
>> plan?
>>
>> Xiao
>>
>> Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：
>>
>>> Thank you, Mridul and Sean.
>>>
>>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>>> course, it's a nice-to-have status. :)
>>>
>>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks
>>> for sharing,
>>>
>>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>>> in April because we took 3 month for Spark 3.1 release.
>>> Let's update our release roadmap of the Apache Spark website.
>>>
>>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>>>
 I'd roughly expect 3.2 in, say, July of this year, given the usual
 cadence. No reason it couldn't be a little sooner or later. There is
 already some good stuff in 3.2 and will be a good minor release in 5-6
 months.

 On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark
> 3.2 seems to be the last minor release of this year. Given the timeframe,
> we might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
> and investigating the publishing issue. Thank you for your contributions
> and feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far.
> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
> we had better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 
> 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
> completely
> via SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Dongjoon Hyun

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:

> Do we have enough features in the current master branch?
>

Hi, Xiao.
Is this a question to Sean's previous comment, `There is already some good
stuff in 3.2 and will be a good minor release in 5-6 months.`?

On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>>
>>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread huaxin gao

Thanks Dongjoon and Xiao for the discussion. I would like to add Data
Source V2 Aggregate push down to the list. I am currently working on
JDBC Data Source V2 Aggregate push down, but the common code can be used
for the file based V2 Data Source as well. For example, MAX and MIN can be
pushed down to Parquet and Orc, since they can use statistics information
to perform these operations efficiently. Quite a few users are
interested in this Aggregate push down feature and the preliminary
performance test for JDBC Aggregate push down is positive. So I think it is
a valuable feature to add for Spark 3.2.

Thanks,
Huaxin

On Fri, Feb 26, 2021 at 11:13 AM Xiao Li  wrote:

> Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
> It might take 1-2 weeks to collect from the community all the features
> we plan to build and ship in 3.2 since we just finished the 3.1 voting.
>
>
>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>> in April because we took 3 month for Spark 3.1 release.
>
>
> TBH, cutting the branch this April does not look good to me. That means,
> we only have one month left for feature development of Spark 3.2. Do we
> have enough features in the current master branch? If not, are we able to
> finish major features we collected here? Do they have a timeline or project
> plan?
>
> Xiao
>
> Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：
>
>> Thank you, Mridul and Sean.
>>
>> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
>> course, it's a nice-to-have status. :)
>>
>> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for
>> sharing,
>>
>> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
>> in April because we took 3 month for Spark 3.1 release.
>> Let's update our release roadmap of the Apache Spark website.
>>
>> > I'd roughly expect 3.2 in, say, July of this year, given the usual
>> cadence. No reason it couldn't be a little sooner or later. There is
>> already some good stuff in 3.2 and will be a good minor release in 5-6
>> months.
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>>
>>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>>> cadence. No reason it couldn't be a little sooner or later. There is
>>> already some good stuff in 3.2 and will be a good minor release in 5-6
>>> months.
>>>
>>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 Since we have been preparing Apache Spark 3.2.0 in master branch since
 December 2020, March seems to be a good time to share our thoughts and
 aspirations on Apache Spark 3.2.

 According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
 seems to be the last minor release of this year. Given the timeframe, we
 might consider the following. (This is a small set. Please add your
 thoughts to this limited list.)

 # Languages

 - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
 slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
 and investigating the publishing issue. Thank you for your contributions
 and feedback on this.

 - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
 Java 11, we need lots of support from our dependencies. Let's see.

 - Python 3.6 Deprecation(?): Python 3.6 community support ends at
 2021-12-23. So, the deprecation is not required yet, but we had better
 prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

 - SparkR CRAN publishing: As we know, it's discontinued so far.
 Resuming it depends on the success of Apache SparkR 3.1.1 CRAN publishing.
 If it succeeds to revive it, we can keep publishing. Otherwise, I believe
 we had better drop it from the releasing work item list officially.

 # Dependencies

 - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
 in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
 shaded clients via SPARK-33212. So far, there is one on-going report at
 YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
 we can move toward Hadoop 3.3.2.

 - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
 instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
 via SPARK-32981 and replaced the generated hive-service-rpc code with the
 official dependency via SPARK-32981. We are steadily improving this area
 and will consume Hive 2.3.9 if available.

 - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
 client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
 support K8s model 1.19.

 - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
 Client 2.6. For Spark 3.2, SPARK-33913 upgraded to

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Xiao Li

Thank you, Dongjoon, for initiating this discussion. Let us keep it open.
It might take 1-2 weeks to collect from the community all the features
we plan to build and ship in 3.2 since we just finished the 3.1 voting.


> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
> in April because we took 3 month for Spark 3.1 release.


TBH, cutting the branch this April does not look good to me. That means, we
only have one month left for feature development of Spark 3.2. Do we have
enough features in the current master branch? If not, are we able to finish
major features we collected here? Do they have a timeline or project plan?

Xiao

Dongjoon Hyun  于2021年2月26日周五 上午10:07写道：

> Thank you, Mridul and Sean.
>
> 1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
> course, it's a nice-to-have status. :)
>
> 2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for
> sharing,
>
> 3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut`
> in April because we took 3 month for Spark 3.1 release.
> Let's update our release roadmap of the Apache Spark website.
>
> > I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> Bests,
> Dongjoon.
>
>
>
> On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:
>
>> I'd roughly expect 3.2 in, say, July of this year, given the usual
>> cadence. No reason it couldn't be a little sooner or later. There is
>> already some good stuff in 3.2 and will be a good minor release in 5-6
>> months.
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to

Re: Apache Spark 3.2 Expectation

2021-02-26 Thread Dongjoon Hyun

Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
course, it's a nice-to-have status. :)

2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for
sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in
April because we took 3 month for Spark 3.1 release.
Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual
cadence. No reason it couldn't be a little sooner or later. There is
already some good stuff in 3.2 and will be a good minor release in 5-6
months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen  wrote:

> I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>> December 2020, March seems to be a good time to share our thoughts and
>> aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>> seems to be the last minor release of this year. Given the timeframe, we
>> might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>> Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>> better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
>> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>> shaded clients via SPARK-33212. So far, there is one on-going report at
>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>> we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>> with Kafka Client 2.8 hopefully.
>>
>> # Some Features
>>
>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>> and up-coming storage partitioned join SPIP can be delivered as a part of
>> Spark 3.2 and become an additional foundation.
>>
>> - Columnar Encryption: As of today, Apache Spark master branch supports
>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>> Apache Spark 3.2 is going to be the first release to have this feature
>> officially. Any feedback is welcome.
>>
>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>> too. I'm expecting more benefits.
>>
>> - Structure Streaming with RocksDB backend: According to the latest
>> update, it looks active enough for merging to master branch in Spark 3.2.
>>
>> Please share your thoughts and let's build better Apache Spark 3.2
>> together.
>>

Re: Apache Spark 3.2 Expectation

2021-02-25 Thread Sean Owen

I'd roughly expect 3.2 in, say, July of this year, given the usual cadence.
No reason it couldn't be a little sooner or later. There is already some
good stuff in 3.2 and will be a good minor release in 5-6 months.

On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>

Re: Apache Spark 3.2 Expectation

2021-02-25 Thread Mridul Muralidharan

Nit: Java 17 -> should be available by Sept 2021 :-)
Adoption would also depend on some of our nontrivial dependencies
supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?

Features:
Push based shuffle and disaggregated shuffle should also be in 3.2


Regards,
Mridul






On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Since we have been preparing Apache Spark 3.2.0 in master branch since
> December 2020, March seems to be a good time to share our thoughts and
> aspirations on Apache Spark 3.2.
>
> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
> seems to be the last minor release of this year. Given the timeframe, we
> might consider the following. (This is a small set. Please add your
> thoughts to this limited list.)
>
> # Languages
>
> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
> out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
> investigating the publishing issue. Thank you for your contributions and
> feedback on this.
>
> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
> Java 11, we need lots of support from our dependencies. Let's see.
>
> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
> 2021-12-23. So, the deprecation is not required yet, but we had better
> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>
> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
> better drop it from the releasing work item list officially.
>
> # Dependencies
>
> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
> shaded clients via SPARK-33212. So far, there is one on-going report at
> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
> we can move toward Hadoop 3.3.2.
>
> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
> of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
> SPARK-32981 and replaced the generated hive-service-rpc code with the
> official dependency via SPARK-32981. We are steadily improving this area
> and will consume Hive 2.3.9 if available.
>
> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
> dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
> K8s model 1.19.
>
> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
> with Kafka Client 2.8 hopefully.
>
> # Some Features
>
> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
> Iceberg integration. Especially, we hope the on-going function catalog SPIP
> and up-coming storage partitioned join SPIP can be delivered as a part of
> Spark 3.2 and become an additional foundation.
>
> - Columnar Encryption: As of today, Apache Spark master branch supports
> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
> Apache Spark 3.2 is going to be the first release to have this feature
> officially. Any feedback is welcome.
>
> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
> too. I'm expecting more benefits.
>
> - Structure Streaming with RocksDB backend: According to the latest
> update, it looks active enough for merging to master branch in Spark 3.2.
>
> Please share your thoughts and let's build better Apache Spark 3.2
> together.
>
> Bests,
> Dongjoon.
>

Apache Spark 3.2 Expectation

2021-02-25 Thread Dongjoon Hyun

Hi, All.

Since we have been preparing Apache Spark 3.2.0 in master branch since
December 2020, March seems to be a good time to share our thoughts and
aspirations on Apache Spark 3.2.

According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
seems to be the last minor release of this year. Given the timeframe, we
might consider the following. (This is a small set. Please add your
thoughts to this limited list.)

# Languages

- Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but slipped
out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505 and
investigating the publishing issue. Thank you for your contributions and
feedback on this.

- Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like Java
11, we need lots of support from our dependencies. Let's see.

- Python 3.6 Deprecation(?): Python 3.6 community support ends at
2021-12-23. So, the deprecation is not required yet, but we had better
prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.

- SparkR CRAN publishing: As we know, it's discontinued so far. Resuming it
depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
succeeds to revive it, we can keep publishing. Otherwise, I believe we had
better drop it from the releasing work item list officially.

# Dependencies

- Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
shaded clients via SPARK-33212. So far, there is one on-going report at
YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
we can move toward Hadoop 3.3.2.

- Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default instead
of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely via
SPARK-32981 and replaced the generated hive-service-rpc code with the
official dependency via SPARK-32981. We are steadily improving this area
and will consume Hive 2.3.9 if available.

- K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s client
dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to support
K8s model 1.19.

- Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
with Kafka Client 2.8 hopefully.

# Some Features

- Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
Iceberg integration. Especially, we hope the on-going function catalog SPIP
and up-coming storage partitioned join SPIP can be delivered as a part of
Spark 3.2 and become an additional foundation.

- Columnar Encryption: As of today, Apache Spark master branch supports
columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
Apache Spark 3.2 is going to be the first release to have this feature
officially. Any feedback is welcome.

- Improved ZStandard Support: Spark 3.2 will bring more benefits for
ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
too. I'm expecting more benefits.

- Structure Streaming with RocksDB backend: According to the latest update,
it looks active enough for merging to master branch in Spark 3.2.

Please share your thoughts and let's build better Apache Spark 3.2 together.

Bests,
Dongjoon.

38 matches

Mail list logo