Re: Apache Spark 3.5.0 Expectations (?)

2023-05-28 Thread Jia Fan
Thanks Dongjoon!
There are some ticket I want to share.
SPARK-39420 Support ANALYZE TABLE on v2 tables
SPARK-42750 Support INSERT INTO by name
SPARK-43521 Support CREATE TABLE LIKE FILE

Dongjoon Hyun  于2023年5月29日周一 08:42写道:

> Hi, All.
>
> Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
>
> I believe it's a good time to share a short summary list (containing both
> completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
>
> Please share your expectations or working items if you want to prioritize
> them more in the community in Apache Spark 3.5.0 timeframe.
>
> (Sorted by ID)
> SPARK-40497 Upgrade Scala 2.13.11
> SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> SPARK-43024 Upgrade Pandas to 2.0.0
> SPARK-43200 Remove Hadoop 2 reference in docs
> SPARK-43347 Remove Python 3.7 Support
> SPARK-43348 Support Python 3.8 in PyPy3
> SPARK-43351 Add Spark Connect Go prototype code and example
> SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> SPARK-43394 Upgrade to Maven 3.8.8
> SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> SPARK-43446 Upgrade to Apache Arrow 12.0.0
> SPARK-43447 Support R 4.3.0
> SPARK-43489 Remove protobuf 2.5.0
> SPARK-43519 Bump Parquet to 1.13.1
> SPARK-43581 Upgrade kubernetes-client to 6.6.2
> SPARK-43588 Upgrade to ASM 9.5
> SPARK-43600 Update K8s doc to recommend K8s 1.24+
> SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
> SPARK-43831 Build and Run Spark on Java 21
> SPARK-43832 Upgrade to Scala 2.12.18
> SPARK-43836 Make Scala 2.13 as default in Spark 3.5
> SPARK-43842 Upgrade gcs-connector to 2.2.14
> SPARK-43844 Update to ORC 1.9.0
> UMBRELLA: Add SQL functions into Scala, Python and R API
>
> Thanks,
> Dongjoon.
>
> PS. The above is not a list of release blockers. Instead, it could be a
> nice-to-have from someone's perspective.
>


Apache Spark 3.5.0 Expectations (?)

2023-05-28 Thread Dongjoon Hyun
Hi, All.

Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
currently a few notable things are under discussions in the mailing list.

I believe it's a good time to share a short summary list (containing both
completed and in-progress items) to give a highlight in advance and to
collect your targets too.

Please share your expectations or working items if you want to prioritize
them more in the community in Apache Spark 3.5.0 timeframe.

(Sorted by ID)
SPARK-40497 Upgrade Scala 2.13.11
SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
1.12.316)
SPARK-43024 Upgrade Pandas to 2.0.0
SPARK-43200 Remove Hadoop 2 reference in docs
SPARK-43347 Remove Python 3.7 Support
SPARK-43348 Support Python 3.8 in PyPy3
SPARK-43351 Add Spark Connect Go prototype code and example
SPARK-43379 Deprecate old Java 8 versions prior to 8u371
SPARK-43394 Upgrade to Maven 3.8.8
SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
SPARK-43446 Upgrade to Apache Arrow 12.0.0
SPARK-43447 Support R 4.3.0
SPARK-43489 Remove protobuf 2.5.0
SPARK-43519 Bump Parquet to 1.13.1
SPARK-43581 Upgrade kubernetes-client to 6.6.2
SPARK-43588 Upgrade to ASM 9.5
SPARK-43600 Update K8s doc to recommend K8s 1.24+
SPARK-43738 Upgrade to DropWizard Metrics 4.2.18
SPARK-43831 Build and Run Spark on Java 21
SPARK-43832 Upgrade to Scala 2.12.18
SPARK-43836 Make Scala 2.13 as default in Spark 3.5
SPARK-43842 Upgrade gcs-connector to 2.2.14
SPARK-43844 Update to ORC 1.9.0
UMBRELLA: Add SQL functions into Scala, Python and R API

Thanks,
Dongjoon.

PS. The above is not a list of release blockers. Instead, it could be a
nice-to-have from someone's perspective.


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-28 Thread Hyukjin Kwon
Yes, some were cases like you mentioned.
But I found myself explaining that reason to a lot of people, not only
developers but users - I was asked in a conference, email, slack,
internally and externally.
Then realised that maybe we're doing something wrong. This is based on my
experience so I wanted to open a discussion and see what others think about
this :-).




On Sat, 27 May 2023 at 00:19, Maciej  wrote:

> Weren't some of these functions provided only for compatibility  and
> intentionally left out of the language APIs?
>
> --
> Best regards,
> Maciej
>
> On 5/25/23 23:21, Hyukjin Kwon wrote:
>
> I don't think it'd be a release blocker .. I think we can implement them
> across multiple releases.
>
> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the proposal.
>>
>> I'm wondering if we are going to consider them as release blockers or not.
>>
>> In general, I don't think those SQL functions should be available in all
>> languages as release blockers.
>> (Especially in R or new Spark Connect languages like Go and Rust).
>>
>> If they are not release blockers, we may allow some existing or future
>> community PRs only before feature freeze (= branch cut).
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>
>>> +1
>>> It is important that different APIs can be used to call the same function
>>>
>>> Ryan Berti  
>>> 于2023年5月25日周四 01:48写道:
>>>
 During my recent experience developing functions, I found that
 identifying locations (sql + connect functions.scala + functions.py,
 FunctionRegistry, + whatever is required for R) and standards for adding
 function signatures was not straight forward (should you use optional args
 or overload functions? which col/lit helpers should be used when?). Are
 there docs describing all of the locations + standards for defining a
 function? If not, that'd be great to have too.

 Ryan Berti

 Senior Data Engineer  |  Ads DE

 M 7023217573

 5808 W Sunset Blvd  |  Los Angeles, CA 90028
 



 On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
 wrote:

> +1
>
> Functions available in SQL (more general in one API) should be
> available in all APIs. I am very much in favor of this.
>
> Enrico
>
>
> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>
> Hi all,
>
> I would like to discuss adding all SQL functions into Scala, Python
> and R API.
> We have SQL functions that do not exist in Scala, Python and R around
> 175.
> For example, we don’t have pyspark.sql.functions.percentile but you
> can invoke
> it as a SQL function, e.g., SELECT percentile(...).
>
> The reason why we do not have all functions in the first place is that
> we want to
> only add commonly used functions, see also
> https://github.com/apache/spark/pull/21318 (which I agreed at that
> time)
>
> However, this has been raised multiple times over years, from the OSS
> community, dev mailing list, JIRAs, stackoverflow, etc.
> Seems it’s confusing about which function is available or not.
>
> Yes, we have a workaround. We can call all expressions by expr("...")
>  or call_udf("...", Columns ...)
> But still it seems that it’s not very user-friendly because they
> expect them available under the functions namespace.
>
> Therefore, I would like to propose adding all expressions into all
> languages so that Spark is simpler and less confusing, e.g., which API is
> in functions or not.
>
> Any thoughts?
>
>
>
>


unsubcribe

2023-05-28 Thread Giorgio Xu
unsubcribe