Re: hadoop-2 profile to be removed in 3.5.0

2023-04-15 Thread yangjie01
Thanks Chao ~

Yang Jie

发件人: Dongjoon Hyun 
日期: 2023年4月16日 星期日 00:08
收件人: Chao Sun 
抄送: dev 
主题: Re: hadoop-2 profile to be removed in 3.5.0

Thank you so much for head-ups, Chao!

Dongjoon.


On Fri, Apr 14, 2023 at 6:33 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
Hi all,

Just a heads up that `hadoop-2` profile is going to be removed in
Apache Spark 3.5.0. This has been discussed previously through this
email thread: 
https://lists.apache.org/thread/z4jdy9959b6zj9t726zl0zcrk4hzs0xs
and is now realized via
https://issues.apache.org/jira/browse/SPARK-42452

Feel free to comment if you still have any concerns.

Thanks.
Chao

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org


Re: Parametrisable output metadata path

2023-04-15 Thread Jungtaek Lim
Hi,

We have been indicated with lots of issues with the current FileStream
sink. The effort to fix these issues are quite significant, and it ended up
with derivation of "Data Lake" products.

I'd recommend not to fix the issue but leave it as its limitation, and
integrate your workload with Data Lake products. For a full disclaimer, I
work in Databricks so I might be biased, but even when I was working at the
previous employer which didn't have the Data Lake product at that time, I
also had to agree that there are too many things to fix, and the effort
would be fully redundant with existing products.

Maybe, it might be helpful to have an "at-least-once" version of FileStream
sink, where a metadata directory is no longer needed. It may require the
implementation to go back to the old way of atomic renaming, but it will
also get rid of the necessity of a metadata directory, so someone might
find it useful. For end-to-end exactly once, people can either use a
limited current FileStream sink or use Data Lake products. I don't see the
value in making improvements to the current FileStream sink.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
wrote:

> Hi!
> I raised a ticket on parametrisable output metadata path
> https://issues.apache.org/jira/browse/SPARK-43152.
> I am going to raise a PR against it and I realised, that this relatively
> simple change impacts on method hasMetadata(path), that would have a new
> meaning if we can define custom path for metadata of output files. Can you
> please share your opinion on  how the custom output metadata path can
> impact on design of structured streaming?
> E.g. I can see one case when I set a parameter of output metadata path,
> run a job on output path A, stop the job, change the output path to B and
> hasMetadata works well. If you have any corner case in mind where the
> parametrised output metadata path can break something please describe it.
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>


Parametrisable output metadata path

2023-04-15 Thread Wojciech Indyk
Hi!
I raised a ticket on parametrisable output metadata path
https://issues.apache.org/jira/browse/SPARK-43152.
I am going to raise a PR against it and I realised, that this relatively
simple change impacts on method hasMetadata(path), that would have a new
meaning if we can define custom path for metadata of output files. Can you
please share your opinion on  how the custom output metadata path can
impact on design of structured streaming?
E.g. I can see one case when I set a parameter of output metadata path, run
a job on output path A, stop the job, change the output path to B and
hasMetadata works well. If you have any corner case in mind where the
parametrised output metadata path can break something please describe it.

--
Kind regards/ Pozdrawiam,
Wojciech Indyk


Re: hadoop-2 profile to be removed in 3.5.0

2023-04-15 Thread Dongjoon Hyun
Thank you so much for head-ups, Chao!

Dongjoon.


On Fri, Apr 14, 2023 at 6:33 PM Chao Sun  wrote:

> Hi all,
>
> Just a heads up that `hadoop-2` profile is going to be removed in
> Apache Spark 3.5.0. This has been discussed previously through this
> email thread:
> https://lists.apache.org/thread/z4jdy9959b6zj9t726zl0zcrk4hzs0xs
> and is now realized via
> https://issues.apache.org/jira/browse/SPARK-42452
>
> Feel free to comment if you still have any concerns.
>
> Thanks.
> Chao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Apache Spark 3.4.0 released

2023-04-15 Thread Dongjoon Hyun
Nice catch, Xiao!

All `latest` tags are updated to v3.4.0 now.

https://hub.docker.com/r/apache/spark/tags
https://hub.docker.com/r/apache/spark-py/tags
https://hub.docker.com/r/apache/spark-r/tags

Dongjoon.


On Fri, Apr 14, 2023 at 8:38 PM Xiao Li  wrote:

> @Dongjoon Hyun  Thank you!
>
> Could you also help update the latest tag ?
> https://hub.docker.com/r/apache/spark/tags
>
> Xiao
>
> Dongjoon Hyun  于2023年4月14日周五 16:23写道:
>
>> Apache Spark Docker images are published too.
>>
>> docker pull apache/spark:v3.4.0
>> docker pull apache/spark-py:v3.4.0
>> docker pull apache/spark-r:v3.4.0
>>
>> Thanks,
>> Dongjoon
>>
>>
>> On Fri, Apr 14, 2023 at 2:56 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Xinrong!
>>>
>>> Dongjoon.
>>>
>>>
>>> On Fri, Apr 14, 2023 at 1:37 PM Xiao Li  wrote:
>>>
 Thank you Xinrong!

 Congratulations everyone! This is a great release with tons of new
 features!



 Gengliang Wang  于2023年4月14日周五 13:04写道:

> Congratulations everyone!
> Thank you Xinrong for driving the release!
>
> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng <
> xinrong.apa...@gmail.com> wrote:
>
>> Hi All,
>>
>> We are happy to announce the availability of *Apache Spark 3.4.0*!
>>
>> Apache Spark 3.4.0 is the fifth release of the 3.x line.
>>
>> To download Spark 3.4.0, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-0.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>