Re: Apache Spark 3.5.0 Expectations (?)

2023-05-31 Thread Dongjoon Hyun
Thank you all for your replies.

1. Thank you, Jia, for those JIRAs.

2. Sounds great for "Scala 2.13 for Spark 4.0". I'll initiate a new thread
for that.
  - "I wonder if it’s safer to do it in Spark 4 (which I believe will be
discussed soon)."
  - "I would make it the default at 4.0, myself."
  - "Shall we initiate a new discussion thread for Scala 2.13 by default?"

3. Thanks. Did you try the pre-built one, Mich?
  - "I spent a day compiling  Spark 3.4.0 code against Scala 2.13.8 with
maven"

  -
https://downloads.apache.org/spark/spark-3.2.4/spark-3.2.4-bin-hadoop3.2-scala2.13.tgz
  -
https://downloads.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3-scala2.13.tgz
  -
https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3-scala2.13.tgz

4. Good suggestion, Bjorn. Instead, we had better add daily jobs like Java
17 because Apache Spark 3.4 added Python 3.11 support via SPARK-41454
already.
- "First, we are currently conducting tests with Python versions 3.8 and
3.9."
- "Should we consider replacing 3.9 with 3.11?"

5. For Guava, I'm also tracking the on-going discussion.

Thanks,
Dongjoon.


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-31 Thread Bjørn Jørgensen
@Cheng Pan

https://issues.apache.org/jira/browse/HIVE-22126

ons. 31. mai 2023 kl. 03:58 skrev Cheng Pan :

> @Bjørn Jørgensen
>
> I did some investigation on upgrading Guava after Spark drop Hadoop2
> support, but unfortunately, the Hive still depends on it, the worse thing
> is, that Guava’s classes are marked as shared in IsolatedClientLoader[1],
> which means Spark can not upgrade Guava even after upgrading the built-in
> Hive from current 2.3.9 to a new version which does not stick on an old
> Guava, to avoid breaking the old version of Hive Metastore client.
>
> I don't find clues why Guava classes need to be marked as shared, can
> anyone bring some background?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L215
>
> Thanks,
> Cheng Pan
>
>
> > On May 31, 2023, at 03:49, Bjørn Jørgensen 
> wrote:
> >
> > @Dongjoon Hyun Thank you.
> >
> > I have two points to discuss.
> > First, we are currently conducting tests with Python versions 3.8 and
> 3.9.
> > Should we consider replacing 3.9 with 3.11?
> >
> > Secondly, I'd like to know the status of Google Guava.
> > With Hadoop version 2 no longer being utilized, is there any other
> factor that is posing a blockage for this?
> >
> > tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
> > I don't know whether it is related but Scala 2.12.17 is fine for the
> Spark 3 family (compile and run) . I spent a day compiling  Spark 3.4.0
> code against Scala 2.13.8 with maven and was getting all sorts of weird and
> wonderful errors at runtime.
> >
> > HTH
> >
> > Mich Talebzadeh,
> > Lead Solutions Architect/Engineering Lead
> > Palantir Technologies Limited
> > London
> > United Kingdom
> >
> >view my Linkedin profile
> >
> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >  Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >
> >
> > On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
> wrote:
> > Shall we initiate a new discussion thread for Scala 2.13 by default?
> While I'm not an expert on this area, it sounds like the change is major
> and (probably) breaking. It seems to be worth having a separate discussion
> thread rather than just treat it like one of 25 items.
> >
> > On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
> > It does seem risky; there are still likely libs out there that don't
> cross compile for 2.13. I would make it the default at 4.0, myself.
> >
> > On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
> wrote:
> > While I support going forward with a higher version, actually using
> Scala 2.13 by default is a big deal especially in a way that:
> > • Users would likely download the built-in version assuming that
> it’s backward binary compatible.
> > • PyPI doesn't allow specifying the Scala version, meaning that
> users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
> > I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
> >
> >
> > On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
> > Thanks Dongjoon!
> > There are some ticket I want to share.
> > SPARK-39420 Support ANALYZE TABLE on v2 tables
> > SPARK-42750 Support INSERT INTO by name
> > SPARK-43521 Support CREATE TABLE LIKE FILE
> >
> > Dongjoon Hyun  于2023年5月29日周一 08:42写道:
> > Hi, All.
> >
> > Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
> >
> > I believe it's a good time to share a short summary list (containing
> both completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
> >
> > Please share your expectations or working items if you want to
> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
> >
> > (Sorted by ID)
> > SPARK-40497 Upgrade Scala 2.13.11
> > SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> > SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> > SPARK-43024 Upgrade Pandas to 2.0.0
> > SPARK-43200 Remove Hadoop 2 reference in docs
> > SPARK-43347 Remove Python 3.7 Support
> > SPARK-43348 Support Python 3.8 in PyPy3
> > SPARK-43351 Add Spark Connect Go prototype code and example
> > SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> > SPARK-43394 Upgrade to Maven 3.8.8
> > SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> > SPARK-43446 Upgrade to Apache Arrow 12.0.0
> > SPARK-43447 Support R 4.3.0
> > SPARK-43489 Remove protobuf 2.5.0
> > SPARK-43519 Bump Parquet to 1.13.1
> > SPARK-43581 Upgrade kubernetes-client to 6.6.2
> > SPARK-43588 Upgrade to ASM 9.5
> > SPARK-43600 Update K8s doc to recommend K8s 

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-30 Thread Cheng Pan
@Bjørn Jørgensen

I did some investigation on upgrading Guava after Spark drop Hadoop2 support, 
but unfortunately, the Hive still depends on it, the worse thing is, that 
Guava’s classes are marked as shared in IsolatedClientLoader[1], which means 
Spark can not upgrade Guava even after upgrading the built-in Hive from current 
2.3.9 to a new version which does not stick on an old Guava, to avoid breaking 
the old version of Hive Metastore client.

I don't find clues why Guava classes need to be marked as shared, can anyone 
bring some background?

[1] 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L215

Thanks,
Cheng Pan


> On May 31, 2023, at 03:49, Bjørn Jørgensen  wrote:
> 
> @Dongjoon Hyun Thank you.
> 
> I have two points to discuss. 
> First, we are currently conducting tests with Python versions 3.8 and 3.9. 
> Should we consider replacing 3.9 with 3.11?
> 
> Secondly, I'd like to know the status of Google Guava. 
> With Hadoop version 2 no longer being utilized, is there any other factor 
> that is posing a blockage for this?
> 
> tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh :
> I don't know whether it is related but Scala 2.12.17 is fine for the Spark 3 
> family (compile and run) . I spent a day compiling  Spark 3.4.0 code against 
> Scala 2.13.8 with maven and was getting all sorts of weird and wonderful 
> errors at runtime.
> 
> HTH
> 
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
> 
>view my Linkedin profile
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
>  Disclaimer: Use it at your own risk. Any and all responsibility for any 
> loss, damage or destruction of data or any other property which may arise 
> from relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction. 
>   
> 
> On Tue, 30 May 2023 at 01:59, Jungtaek Lim  
> wrote:
> Shall we initiate a new discussion thread for Scala 2.13 by default? While 
> I'm not an expert on this area, it sounds like the change is major and 
> (probably) breaking. It seems to be worth having a separate discussion thread 
> rather than just treat it like one of 25 items.
> 
> On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
> It does seem risky; there are still likely libs out there that don't cross 
> compile for 2.13. I would make it the default at 4.0, myself.
> 
> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:
> While I support going forward with a higher version, actually using Scala 
> 2.13 by default is a big deal especially in a way that:
> • Users would likely download the built-in version assuming that it’s 
> backward binary compatible.
> • PyPI doesn't allow specifying the Scala version, meaning that users 
> wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
> I wonder if it’s safer to do it in Spark 4 (which I believe will be discussed 
> soon).
> 
> 
> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
> Thanks Dongjoon!
> There are some ticket I want to share.
> SPARK-39420 Support ANALYZE TABLE on v2 tables
> SPARK-42750 Support INSERT INTO by name
> SPARK-43521 Support CREATE TABLE LIKE FILE
> 
> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
> Hi, All.
> 
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and 
> currently a few notable things are under discussions in the mailing list.
> 
> I believe it's a good time to share a short summary list (containing both 
> completed and in-progress items) to give a highlight in advance and to 
> collect your targets too.
> 
> Please share your expectations or working items if you want to prioritize 
> them more in the community in Apache Spark 3.5.0 timeframe.
> 
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 -> 
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8s doc to recommend K8s 1.24+
> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
> SPARK-43831 Build and Run Spark on Java 21
> SPARK-43832 Upgrade to Scala 2.12.18
> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
> SPARK-43842 Upgrade gcs-connector to 2.2.14
> SPARK-43844 Update to ORC 1.9.0
> 

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-30 Thread Bjørn Jørgensen
@Dongjoon Hyun  Thank you.

I have two points to discuss.
First, we are currently conducting tests with Python versions 3.8 and 3.9.
Should we consider replacing 3.9 with 3.11?

Secondly, I'd like to know the status of Google Guava.
With Hadoop version 2 no longer being utilized, is there any other factor
that is posing a blockage for this?

tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh :

> I don't know whether it is related but Scala 2.12.17 is fine for the Spark
> 3 family (compile and run) . I spent a day compiling  Spark 3.4.0 code
> against Scala 2.13.8 with maven and was getting all sorts of weird and
> wonderful errors at runtime.
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
> wrote:
>
>> Shall we initiate a new discussion thread for Scala 2.13 by default?
>> While I'm not an expert on this area, it sounds like the change is major
>> and (probably) breaking. It seems to be worth having a separate
>> discussion thread rather than just treat it like one of 25 items.
>>
>> On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
>>
>>> It does seem risky; there are still likely libs out there that don't
>>> cross compile for 2.13. I would make it the default at 4.0, myself.
>>>
>>> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
>>> wrote:
>>>
 While I support going forward with a higher version, actually using
 Scala 2.13 by default is a big deal especially in a way that:

- Users would likely download the built-in version assuming that
it’s backward binary compatible.
- PyPI doesn't allow specifying the Scala version, meaning that
users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.

 I wonder if it’s safer to do it in Spark 4 (which I believe will be
 discussed soon).


 On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:

> Thanks Dongjoon!
> There are some ticket I want to share.
> SPARK-39420 Support ANALYZE TABLE on v2 tables
> SPARK-42750 Support INSERT INTO by name
> SPARK-43521 Support CREATE TABLE LIKE FILE
>
> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>
>> Hi, All.
>>
>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate)
>> and currently a few notable things are under discussions in the mailing
>> list.
>>
>> I believe it's a good time to share a short summary list (containing
>> both completed and in-progress items) to give a highlight in advance and 
>> to
>> collect your targets too.
>>
>> Please share your expectations or working items if you want to
>> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>>
>> (Sorted by ID)
>> SPARK-40497 Upgrade Scala 2.13.11
>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>> 1.12.316)
>> SPARK-43024 Upgrade Pandas to 2.0.0
>> SPARK-43200 Remove Hadoop 2 reference in docs
>> SPARK-43347 Remove Python 3.7 Support
>> SPARK-43348 Support Python 3.8 in PyPy3
>> SPARK-43351 Add Spark Connect Go prototype code and example
>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>> SPARK-43394 Upgrade to Maven 3.8.8
>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>> SPARK-43447 Support R 4.3.0
>> SPARK-43489 Remove protobuf 2.5.0
>> SPARK-43519 Bump Parquet to 1.13.1
>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>> SPARK-43588 Upgrade to ASM 9.5
>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>> SPARK-43831 Build and Run Spark on Java 21
>> SPARK-43832 Upgrade to Scala 2.12.18
>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>> SPARK-43844 Update to ORC 1.9.0
>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>
>> Thanks,
>> Dongjoon.
>>
>> PS. The above is not a list of release blockers. Instead, it could be
>> a nice-to-have from someone's perspective.
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-30 Thread Mich Talebzadeh
I don't know whether it is related but Scala 2.12.17 is fine for the Spark
3 family (compile and run) . I spent a day compiling  Spark 3.4.0 code
against Scala 2.13.8 with maven and was getting all sorts of weird and
wonderful errors at runtime.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
wrote:

> Shall we initiate a new discussion thread for Scala 2.13 by default? While
> I'm not an expert on this area, it sounds like the change is major and
> (probably) breaking. It seems to be worth having a separate
> discussion thread rather than just treat it like one of 25 items.
>
> On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
>
>> It does seem risky; there are still likely libs out there that don't
>> cross compile for 2.13. I would make it the default at 4.0, myself.
>>
>> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
>> wrote:
>>
>>> While I support going forward with a higher version, actually using
>>> Scala 2.13 by default is a big deal especially in a way that:
>>>
>>>- Users would likely download the built-in version assuming that
>>>it’s backward binary compatible.
>>>- PyPI doesn't allow specifying the Scala version, meaning that
>>>users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>>>
>>> I wonder if it’s safer to do it in Spark 4 (which I believe will be
>>> discussed soon).
>>>
>>>
>>> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>>>
 Thanks Dongjoon!
 There are some ticket I want to share.
 SPARK-39420 Support ANALYZE TABLE on v2 tables
 SPARK-42750 Support INSERT INTO by name
 SPARK-43521 Support CREATE TABLE LIKE FILE

 Dongjoon Hyun  于2023年5月29日周一 08:42写道:

> Hi, All.
>
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
>
> I believe it's a good time to share a short summary list (containing
> both completed and in-progress items) to give a highlight in advance and 
> to
> collect your targets too.
>
> Please share your expectations or working items if you want to
> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8s doc to recommend K8s 1.24+
> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
> SPARK-43831 Build and Run Spark on Java 21
> SPARK-43832 Upgrade to Scala 2.12.18
> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
> SPARK-43842 Upgrade gcs-connector to 2.2.14
> SPARK-43844 Update to ORC 1.9.0
> UMBRELLA: Add SQL functions into Scala, Python and R API
>
> Thanks,
> Dongjoon.
>
> PS. The above is not a list of release blockers. Instead, it could be
> a nice-to-have from someone's perspective.
>



Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Jungtaek Lim
Shall we initiate a new discussion thread for Scala 2.13 by default? While
I'm not an expert on this area, it sounds like the change is major and
(probably) breaking. It seems to be worth having a separate
discussion thread rather than just treat it like one of 25 items.

On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:

> It does seem risky; there are still likely libs out there that don't cross
> compile for 2.13. I would make it the default at 4.0, myself.
>
> On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:
>
>> While I support going forward with a higher version, actually using Scala
>> 2.13 by default is a big deal especially in a way that:
>>
>>- Users would likely download the built-in version assuming that it’s
>>backward binary compatible.
>>- PyPI doesn't allow specifying the Scala version, meaning that users
>>wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>>
>> I wonder if it’s safer to do it in Spark 4 (which I believe will be
>> discussed soon).
>>
>>
>> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>>
>>> Thanks Dongjoon!
>>> There are some ticket I want to share.
>>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>>> SPARK-42750 Support INSERT INTO by name
>>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>>
>>> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>>>
 Hi, All.

 Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
 currently a few notable things are under discussions in the mailing list.

 I believe it's a good time to share a short summary list (containing
 both completed and in-progress items) to give a highlight in advance and to
 collect your targets too.

 Please share your expectations or working items if you want to
 prioritize them more in the community in Apache Spark 3.5.0 timeframe.

 (Sorted by ID)
 SPARK-40497 Upgrade Scala 2.13.11
 SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
 SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
 1.12.316)
 SPARK-43024 Upgrade Pandas to 2.0.0
 SPARK-43200 Remove Hadoop 2 reference in docs
 SPARK-43347 Remove Python 3.7 Support
 SPARK-43348 Support Python 3.8 in PyPy3
 SPARK-43351 Add Spark Connect Go prototype code and example
 SPARK-43379 Deprecate old Java 8 versions prior to 8u371
 SPARK-43394 Upgrade to Maven 3.8.8
 SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
 SPARK-43446 Upgrade to Apache Arrow 12.0.0
 SPARK-43447 Support R 4.3.0
 SPARK-43489 Remove protobuf 2.5.0
 SPARK-43519 Bump Parquet to 1.13.1
 SPARK-43581 Upgrade kubernetes-client to 6.6.2
 SPARK-43588 Upgrade to ASM 9.5
 SPARK-43600 Update K8s doc to recommend K8s 1.24+
 SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
 SPARK-43831 Build and Run Spark on Java 21
 SPARK-43832 Upgrade to Scala 2.12.18
 SPARK-43836 Make Scala 2.13 as default in Spark 3.5
 SPARK-43842 Upgrade gcs-connector to 2.2.14
 SPARK-43844 Update to ORC 1.9.0
 UMBRELLA: Add SQL functions into Scala, Python and R API

 Thanks,
 Dongjoon.

 PS. The above is not a list of release blockers. Instead, it could be a
 nice-to-have from someone's perspective.

>>>


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Sean Owen
It does seem risky; there are still likely libs out there that don't cross
compile for 2.13. I would make it the default at 4.0, myself.

On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon  wrote:

> While I support going forward with a higher version, actually using Scala
> 2.13 by default is a big deal especially in a way that:
>
>- Users would likely download the built-in version assuming that it’s
>backward binary compatible.
>- PyPI doesn't allow specifying the Scala version, meaning that users
>wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
>
> I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
>
>
> On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
>
>> Thanks Dongjoon!
>> There are some ticket I want to share.
>> SPARK-39420 Support ANALYZE TABLE on v2 tables
>> SPARK-42750 Support INSERT INTO by name
>> SPARK-43521 Support CREATE TABLE LIKE FILE
>>
>> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>>
>>> Hi, All.
>>>
>>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
>>> currently a few notable things are under discussions in the mailing list.
>>>
>>> I believe it's a good time to share a short summary list (containing
>>> both completed and in-progress items) to give a highlight in advance and to
>>> collect your targets too.
>>>
>>> Please share your expectations or working items if you want to
>>> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
>>>
>>> (Sorted by ID)
>>> SPARK-40497 Upgrade Scala 2.13.11
>>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>>> 1.12.316)
>>> SPARK-43024 Upgrade Pandas to 2.0.0
>>> SPARK-43200 Remove Hadoop 2 reference in docs
>>> SPARK-43347 Remove Python 3.7 Support
>>> SPARK-43348 Support Python 3.8 in PyPy3
>>> SPARK-43351 Add Spark Connect Go prototype code and example
>>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>>> SPARK-43394 Upgrade to Maven 3.8.8
>>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>>> SPARK-43447 Support R 4.3.0
>>> SPARK-43489 Remove protobuf 2.5.0
>>> SPARK-43519 Bump Parquet to 1.13.1
>>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>>> SPARK-43588 Upgrade to ASM 9.5
>>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>>> SPARK-43831 Build and Run Spark on Java 21
>>> SPARK-43832 Upgrade to Scala 2.12.18
>>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>>> SPARK-43844 Update to ORC 1.9.0
>>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>> PS. The above is not a list of release blockers. Instead, it could be a
>>> nice-to-have from someone's perspective.
>>>
>>


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-29 Thread Hyukjin Kwon
While I support going forward with a higher version, actually using Scala
2.13 by default is a big deal especially in a way that:

   - Users would likely download the built-in version assuming that it’s
   backward binary compatible.
   - PyPI doesn't allow specifying the Scala version, meaning that users
   wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.

I wonder if it’s safer to do it in Spark 4 (which I believe will be
discussed soon).


On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:

> Thanks Dongjoon!
> There are some ticket I want to share.
> SPARK-39420 Support ANALYZE TABLE on v2 tables
> SPARK-42750 Support INSERT INTO by name
> SPARK-43521 Support CREATE TABLE LIKE FILE
>
> Dongjoon Hyun  于2023年5月29日周一 08:42写道:
>
>> Hi, All.
>>
>> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
>> currently a few notable things are under discussions in the mailing list.
>>
>> I believe it's a good time to share a short summary list (containing both
>> completed and in-progress items) to give a highlight in advance and to
>> collect your targets too.
>>
>> Please share your expectations or working items if you want to prioritize
>> them more in the community in Apache Spark 3.5.0 timeframe.
>>
>> (Sorted by ID)
>> SPARK-40497 Upgrade Scala 2.13.11
>> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
>> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
>> 1.12.316)
>> SPARK-43024 Upgrade Pandas to 2.0.0
>> SPARK-43200 Remove Hadoop 2 reference in docs
>> SPARK-43347 Remove Python 3.7 Support
>> SPARK-43348 Support Python 3.8 in PyPy3
>> SPARK-43351 Add Spark Connect Go prototype code and example
>> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
>> SPARK-43394 Upgrade to Maven 3.8.8
>> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
>> SPARK-43446 Upgrade to Apache Arrow 12.0.0
>> SPARK-43447 Support R 4.3.0
>> SPARK-43489 Remove protobuf 2.5.0
>> SPARK-43519 Bump Parquet to 1.13.1
>> SPARK-43581 Upgrade kubernetes-client to 6.6.2
>> SPARK-43588 Upgrade to ASM 9.5
>> SPARK-43600 Update K8s doc to recommend K8s 1.24+
>> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
>> SPARK-43831 Build and Run Spark on Java 21
>> SPARK-43832 Upgrade to Scala 2.12.18
>> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
>> SPARK-43842 Upgrade gcs-connector to 2.2.14
>> SPARK-43844 Update to ORC 1.9.0
>> UMBRELLA: Add SQL functions into Scala, Python and R API
>>
>> Thanks,
>> Dongjoon.
>>
>> PS. The above is not a list of release blockers. Instead, it could be a
>> nice-to-have from someone's perspective.
>>
>


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-28 Thread Jia Fan
Thanks Dongjoon!
There are some ticket I want to share.
SPARK-39420 Support ANALYZE TABLE on v2 tables
SPARK-42750 Support INSERT INTO by name
SPARK-43521 Support CREATE TABLE LIKE FILE

Dongjoon Hyun  于2023年5月29日周一 08:42写道:

> Hi, All.
>
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
>
> I believe it's a good time to share a short summary list (containing both
> completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
>
> Please share your expectations or working items if you want to prioritize
> them more in the community in Apache Spark 3.5.0 timeframe.
>
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8s doc to recommend K8s 1.24+
> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
> SPARK-43831 Build and Run Spark on Java 21
> SPARK-43832 Upgrade to Scala 2.12.18
> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
> SPARK-43842 Upgrade gcs-connector to 2.2.14
> SPARK-43844 Update to ORC 1.9.0
> UMBRELLA: Add SQL functions into Scala, Python and R API
>
> Thanks,
> Dongjoon.
>
> PS. The above is not a list of release blockers. Instead, it could be a
> nice-to-have from someone's perspective.
>


Apache Spark 3.5.0 Expectations (?)

2023-05-28 Thread Dongjoon Hyun
Hi, All.

Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
currently a few notable things are under discussions in the mailing list.

I believe it's a good time to share a short summary list (containing both
completed and in-progress items) to give a highlight in advance and to
collect your targets too.

Please share your expectations or working items if you want to prioritize
them more in the community in Apache Spark 3.5.0 timeframe.

(Sorted by ID)
SPARK-40497 Upgrade Scala 2.13.11
SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
1.12.316)
SPARK-43024 Upgrade Pandas to 2.0.0
SPARK-43200 Remove Hadoop 2 reference in docs
SPARK-43347 Remove Python 3.7 Support
SPARK-43348 Support Python 3.8 in PyPy3
SPARK-43351 Add Spark Connect Go prototype code and example
SPARK-43379 Deprecate old Java 8 versions prior to 8u371
SPARK-43394 Upgrade to Maven 3.8.8
SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
SPARK-43446 Upgrade to Apache Arrow 12.0.0
SPARK-43447 Support R 4.3.0
SPARK-43489 Remove protobuf 2.5.0
SPARK-43519 Bump Parquet to 1.13.1
SPARK-43581 Upgrade kubernetes-client to 6.6.2
SPARK-43588 Upgrade to ASM 9.5
SPARK-43600 Update K8s doc to recommend K8s 1.24+
SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
SPARK-43831 Build and Run Spark on Java 21
SPARK-43832 Upgrade to Scala 2.12.18
SPARK-43836 Make Scala 2.13 as default in Spark 3.5
SPARK-43842 Upgrade gcs-connector to 2.2.14
SPARK-43844 Update to ORC 1.9.0
UMBRELLA: Add SQL functions into Scala, Python and R API

Thanks,
Dongjoon.

PS. The above is not a list of release blockers. Instead, it could be a
nice-to-have from someone's perspective.