Re: Spark 3.2.4 pom NOT FOUND on maven

2023-04-17 Thread Enrico Minack

Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) release?

Cheers,
Enrico


Am 17.04.23 um 08:19 schrieb Enrico Minack:

Hi,

thanks for the Spark 3.2.4 release.

I have found that Maven does not serve the spark-parent_2.13 pom file. 
It is listed in the directory:

https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/

But cannot be downloaded:
https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom 



The 2.12 file is fine:
https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/3.2.4/spark-parent_2.12-3.2.4.pom 



Any chance this can be fixed?

Cheers,
Enrico


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
Thanks Elliot ! Let me check it out !

On Mon, 17 Apr, 2023, 10:08 pm Elliot West,  wrote:

> Hi Ankit,
>
> While not a part of Spark, there is a project called 'WaggleDance' that
> can federate multiple Hive metastores so that they are accessible via a
> single URI: https://github.com/ExpediaGroup/waggle-dance
>
> This may be useful or perhaps serve as inspiration.
>
> Thanks,
>
> Elliot.
>
> On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:
>
>> ++
>> User Mailing List
>>
>> Just a reminder, anyone who can help on this.
>>
>> Thanks a lot !
>>
>> Ankit Prakash Gupta
>>
>> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta 
>> wrote:
>>
>>> Hi All
>>>
>>> The question is regarding the support of multiple Remote Hive Metastore
>>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>>> spark, but have we implemented any CatalogPlugin that can help us configure
>>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>>> the Fully Qualified Class Name that I can try using for configuring a Hive
>>> Metastore Catalog. If not, I would like to work on the implementation of
>>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>>> Servers' .
>>>
>>> Thanks and Regards.
>>>
>>> Ankit Prakash Gupta
>>> +91 8750101321
>>> info.ank...@gmail.com
>>>
>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
small correction: "I intentionally didn't enumerate." The meaning could be
quite different so making a small correction.

On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim 
wrote:

> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>
> We made non-trivial numbers of band-aid fixes already for file stream
> sink. For example,
>
> https://github.com/apache/spark/pull/28363
> https://github.com/apache/spark/pull/28904
> https://github.com/apache/spark/pull/29505
> https://github.com/apache/spark/pull/31638
>
> There were many push backs, because these fixes do not solve the real
> problem. The consensus was that we don't want to come up with another Data
> Lake product which requires us to put months (or maybe years) of effort.
> Now, these Data Lake products are backed by companies and they are
> successful projects as individuals. I'm not sure I can be supportive with
> the effort on another band-aid fix.
>
> Maintaining metadata directory is a root of the headache. Unless we see
> the benefit of removing the metadata directory (hence at-least-once) and
> plan to deal with that, I'd like to leave file stream sink as it is.
>
> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
> wrote:
>
>> Hi Jungtaek,
>> integration with Delta Lake is not an option to me, I raised a PR for
>> improvement of FileStreamSink with the new parameter:
>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>>
>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
>> napisał(a):
>>
>>> Hi,
>>>
>>> We have been indicated with lots of issues with the current FileStream
>>> sink. The effort to fix these issues are quite significant, and it ended up
>>> with derivation of "Data Lake" products.
>>>
>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>> work in Databricks so I might be biased, but even when I was working at the
>>> previous employer which didn't have the Data Lake product at that time, I
>>> also had to agree that there are too many things to fix, and the effort
>>> would be fully redundant with existing products.
>>>
>>> Maybe, it might be helpful to have an "at-least-once" version of
>>> FileStream sink, where a metadata directory is no longer needed. It may
>>> require the implementation to go back to the old way of atomic renaming,
>>> but it will also get rid of the necessity of a metadata directory, so
>>> someone might find it useful. For end-to-end exactly once, people can
>>> either use a limited current FileStream sink or use Data Lake products. I
>>> don't see the value in making improvements to the current FileStream sink.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>>> wrote:
>>>
 Hi!
 I raised a ticket on parametrisable output metadata path
 https://issues.apache.org/jira/browse/SPARK-43152.
 I am going to raise a PR against it and I realised, that this
 relatively simple change impacts on method hasMetadata(path), that would
 have a new meaning if we can define custom path for metadata of output
 files. Can you please share your opinion on  how the custom output metadata
 path can impact on design of structured streaming?
 E.g. I can see one case when I set a parameter of output metadata path,
 run a job on output path A, stop the job, change the output path to B and
 hasMetadata works well. If you have any corner case in mind where the
 parametrised output metadata path can break something please describe it.

 --
 Kind regards/ Pozdrawiam,
 Wojciech Indyk

>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
There seems to be miscommunication - I didn't mean "Delta Lake". I meant
"any" Data Lake products. Since I'm biased I didn't intentionally enumerate
actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.

We made non-trivial numbers of band-aid fixes already for file stream sink.
For example,

https://github.com/apache/spark/pull/28363
https://github.com/apache/spark/pull/28904
https://github.com/apache/spark/pull/29505
https://github.com/apache/spark/pull/31638

There were many push backs, because these fixes do not solve the real
problem. The consensus was that we don't want to come up with another Data
Lake product which requires us to put months (or maybe years) of effort.
Now, these Data Lake products are backed by companies and they are
successful projects as individuals. I'm not sure I can be supportive with
the effort on another band-aid fix.

Maintaining metadata directory is a root of the headache. Unless we see the
benefit of removing the metadata directory (hence at-least-once) and plan
to deal with that, I'd like to leave file stream sink as it is.

On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
wrote:

> Hi Jungtaek,
> integration with Delta Lake is not an option to me, I raised a PR for
> improvement of FileStreamSink with the new parameter:
> https://github.com/apache/spark/pull/40821. Can you please take a look?
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>
>
> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
> napisał(a):
>
>> Hi,
>>
>> We have been indicated with lots of issues with the current FileStream
>> sink. The effort to fix these issues are quite significant, and it ended up
>> with derivation of "Data Lake" products.
>>
>> I'd recommend not to fix the issue but leave it as its limitation, and
>> integrate your workload with Data Lake products. For a full disclaimer, I
>> work in Databricks so I might be biased, but even when I was working at the
>> previous employer which didn't have the Data Lake product at that time, I
>> also had to agree that there are too many things to fix, and the effort
>> would be fully redundant with existing products.
>>
>> Maybe, it might be helpful to have an "at-least-once" version of
>> FileStream sink, where a metadata directory is no longer needed. It may
>> require the implementation to go back to the old way of atomic renaming,
>> but it will also get rid of the necessity of a metadata directory, so
>> someone might find it useful. For end-to-end exactly once, people can
>> either use a limited current FileStream sink or use Data Lake products. I
>> don't see the value in making improvements to the current FileStream sink.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>> wrote:
>>
>>> Hi!
>>> I raised a ticket on parametrisable output metadata path
>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>> I am going to raise a PR against it and I realised, that this relatively
>>> simple change impacts on method hasMetadata(path), that would have a new
>>> meaning if we can define custom path for metadata of output files. Can you
>>> please share your opinion on  how the custom output metadata path can
>>> impact on design of structured streaming?
>>> E.g. I can see one case when I set a parameter of output metadata path,
>>> run a job on output path A, stop the job, change the output path to B and
>>> hasMetadata works well. If you have any corner case in mind where the
>>> parametrised output metadata path can break something please describe it.
>>>
>>> --
>>> Kind regards/ Pozdrawiam,
>>> Wojciech Indyk
>>>
>>


Re: [ANNOUNCE] Apache Spark 3.4.0 released

2023-04-17 Thread Xinrong Meng
Thank you, Dongjoon!

On Sat, Apr 15, 2023 at 9:04 AM Dongjoon Hyun 
wrote:

> Nice catch, Xiao!
>
> All `latest` tags are updated to v3.4.0 now.
>
> https://hub.docker.com/r/apache/spark/tags
> https://hub.docker.com/r/apache/spark-py/tags
> https://hub.docker.com/r/apache/spark-r/tags
>
> Dongjoon.
>
>
> On Fri, Apr 14, 2023 at 8:38 PM Xiao Li  wrote:
>
>> @Dongjoon Hyun  Thank you!
>>
>> Could you also help update the latest tag ?
>> https://hub.docker.com/r/apache/spark/tags
>>
>> Xiao
>>
>> Dongjoon Hyun  于2023年4月14日周五 16:23写道:
>>
>>> Apache Spark Docker images are published too.
>>>
>>> docker pull apache/spark:v3.4.0
>>> docker pull apache/spark-py:v3.4.0
>>> docker pull apache/spark-r:v3.4.0
>>>
>>> Thanks,
>>> Dongjoon
>>>
>>>
>>> On Fri, Apr 14, 2023 at 2:56 PM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you, Xinrong!

 Dongjoon.


 On Fri, Apr 14, 2023 at 1:37 PM Xiao Li  wrote:

> Thank you Xinrong!
>
> Congratulations everyone! This is a great release with tons of new
> features!
>
>
>
> Gengliang Wang  于2023年4月14日周五 13:04写道:
>
>> Congratulations everyone!
>> Thank you Xinrong for driving the release!
>>
>> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng <
>> xinrong.apa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We are happy to announce the availability of *Apache Spark 3.4.0*!
>>>
>>> Apache Spark 3.4.0 is the fifth release of the 3.x line.
>>>
>>> To download Spark 3.4.0, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-4-0.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> to this
>>> release. This release would not have been possible without you.
>>>
>>> Thanks,
>>>
>>> Xinrong Meng
>>>
>>


Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports
connecting multiple HMS in a single Spark application.

Some limitations

- currently only supports Spark 3.3
- has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and
normal jar-based Spark application.

[1]
https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive

Thanks,
Cheng Pan


On Apr 18, 2023 at 00:38:23, Elliot West  wrote:

> Hi Ankit,
>
> While not a part of Spark, there is a project called 'WaggleDance' that
> can federate multiple Hive metastores so that they are accessible via a
> single URI: https://github.com/ExpediaGroup/waggle-dance
>
> This may be useful or perhaps serve as inspiration.
>
> Thanks,
>
> Elliot.
>
> On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:
>
>> ++
>> User Mailing List
>>
>> Just a reminder, anyone who can help on this.
>>
>> Thanks a lot !
>>
>> Ankit Prakash Gupta
>>
>> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta 
>> wrote:
>>
>>> Hi All
>>>
>>> The question is regarding the support of multiple Remote Hive Metastore
>>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>>> spark, but have we implemented any CatalogPlugin that can help us configure
>>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>>> the Fully Qualified Class Name that I can try using for configuring a Hive
>>> Metastore Catalog. If not, I would like to work on the implementation of
>>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>>> Servers' .
>>>
>>> Thanks and Regards.
>>>
>>> Ankit Prakash Gupta
>>> +91 8750101321
>>> info.ank...@gmail.com
>>>
>>>


Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Elliot West
Hi Ankit,

While not a part of Spark, there is a project called 'WaggleDance' that can
federate multiple Hive metastores so that they are accessible via a single
URI: https://github.com/ExpediaGroup/waggle-dance

This may be useful or perhaps serve as inspiration.

Thanks,

Elliot.

On Mon, 17 Apr 2023 at 16:38, Ankit Gupta  wrote:

> ++
> User Mailing List
>
> Just a reminder, anyone who can help on this.
>
> Thanks a lot !
>
> Ankit Prakash Gupta
>
> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta  wrote:
>
>> Hi All
>>
>> The question is regarding the support of multiple Remote Hive Metastore
>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
>> spark, but have we implemented any CatalogPlugin that can help us configure
>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
>> the Fully Qualified Class Name that I can try using for configuring a Hive
>> Metastore Catalog. If not, I would like to work on the implementation of
>> the CatalogPlugin that we can use to configure multiple Hive Metastore
>> Servers' .
>>
>> Thanks and Regards.
>>
>> Ankit Prakash Gupta
>> +91 8750101321
>> info.ank...@gmail.com
>>
>>


Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
++
User Mailing List

Just a reminder, anyone who can help on this.

Thanks a lot !

Ankit Prakash Gupta

On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta  wrote:

> Hi All
>
> The question is regarding the support of multiple Remote Hive Metastore
> catalogs with Spark. Starting Spark 3, multiple catalog support is added in
> spark, but have we implemented any CatalogPlugin that can help us configure
> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with
> the Fully Qualified Class Name that I can try using for configuring a Hive
> Metastore Catalog. If not, I would like to work on the implementation of
> the CatalogPlugin that we can use to configure multiple Hive Metastore
> Servers' .
>
> Thanks and Regards.
>
> Ankit Prakash Gupta
> +91 8750101321
> info.ank...@gmail.com
>
>


Spark 3.2.4 pom NOT FOUND on maven

2023-04-17 Thread Enrico Minack

Hi,

thanks for the Spark 3.2.4 release.

I have found that Maven does not serve the spark-parent_2.13 pom file. 
It is listed in the directory:

https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/

But cannot be downloaded:
https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom

The 2.12 file is fine:
https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/3.2.4/spark-parent_2.12-3.2.4.pom

Any chance this can be fixed?

Cheers,
Enrico


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Parametrisable output metadata path

2023-04-17 Thread Wojciech Indyk
Hi Jungtaek,
integration with Delta Lake is not an option to me, I raised a PR for
improvement of FileStreamSink with the new parameter:
https://github.com/apache/spark/pull/40821. Can you please take a look?

--
Kind regards/ Pozdrawiam,
Wojciech Indyk


niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
napisał(a):

> Hi,
>
> We have been indicated with lots of issues with the current FileStream
> sink. The effort to fix these issues are quite significant, and it ended up
> with derivation of "Data Lake" products.
>
> I'd recommend not to fix the issue but leave it as its limitation, and
> integrate your workload with Data Lake products. For a full disclaimer, I
> work in Databricks so I might be biased, but even when I was working at the
> previous employer which didn't have the Data Lake product at that time, I
> also had to agree that there are too many things to fix, and the effort
> would be fully redundant with existing products.
>
> Maybe, it might be helpful to have an "at-least-once" version of
> FileStream sink, where a metadata directory is no longer needed. It may
> require the implementation to go back to the old way of atomic renaming,
> but it will also get rid of the necessity of a metadata directory, so
> someone might find it useful. For end-to-end exactly once, people can
> either use a limited current FileStream sink or use Data Lake products. I
> don't see the value in making improvements to the current FileStream sink.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
> wrote:
>
>> Hi!
>> I raised a ticket on parametrisable output metadata path
>> https://issues.apache.org/jira/browse/SPARK-43152.
>> I am going to raise a PR against it and I realised, that this relatively
>> simple change impacts on method hasMetadata(path), that would have a new
>> meaning if we can define custom path for metadata of output files. Can you
>> please share your opinion on  how the custom output metadata path can
>> impact on design of structured streaming?
>> E.g. I can see one case when I set a parameter of output metadata path,
>> run a job on output path A, stop the job, change the output path to B and
>> hasMetadata works well. If you have any corner case in mind where the
>> parametrised output metadata path can break something please describe it.
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>


The Spark email setting should be update

2023-04-17 Thread Jia Fan
Hi, everyone.

I find that every time I reply to dev's mailing list, the default address
of the reply is the sender of the mail, not dev@spark.apache.org. It caused
me to think that the email reply to dev was successful several times, but
it wasn't. This should not be a common problem, because when I reply to
emails from other communities, the default reply address is
d...@xxx.apache.org. Can spark modify the corresponding settings to reduce
the chance of developers replying incorrectly?

Thanks





Jia Fan