Re: Spark 3.2.4 pom NOT FOUND on maven
Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) release? Cheers, Enrico Am 17.04.23 um 08:19 schrieb Enrico Minack: Hi, thanks for the Spark 3.2.4 release. I have found that Maven does not serve the spark-parent_2.13 pom file. It is listed in the directory: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/ But cannot be downloaded: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom The 2.12 file is fine: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/3.2.4/spark-parent_2.12-3.2.4.pom Any chance this can be fixed? Cheers, Enrico - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Spark Multiple Hive Metastore Catalog Support
Thanks Elliot ! Let me check it out ! On Mon, 17 Apr, 2023, 10:08 pm Elliot West, wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI: https://github.com/ExpediaGroup/waggle-dance > > This may be useful or perhaps serve as inspiration. > > Thanks, > > Elliot. > > On Mon, 17 Apr 2023 at 16:38, Ankit Gupta wrote: > >> ++ >> User Mailing List >> >> Just a reminder, anyone who can help on this. >> >> Thanks a lot ! >> >> Ankit Prakash Gupta >> >> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta >> wrote: >> >>> Hi All >>> >>> The question is regarding the support of multiple Remote Hive Metastore >>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in >>> spark, but have we implemented any CatalogPlugin that can help us configure >>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with >>> the Fully Qualified Class Name that I can try using for configuring a Hive >>> Metastore Catalog. If not, I would like to work on the implementation of >>> the CatalogPlugin that we can use to configure multiple Hive Metastore >>> Servers' . >>> >>> Thanks and Regards. >>> >>> Ankit Prakash Gupta >>> +91 8750101321 >>> info.ank...@gmail.com >>> >>>
Re: Parametrisable output metadata path
small correction: "I intentionally didn't enumerate." The meaning could be quite different so making a small correction. On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim wrote: > There seems to be miscommunication - I didn't mean "Delta Lake". I meant > "any" Data Lake products. Since I'm biased I didn't intentionally enumerate > actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well. > > We made non-trivial numbers of band-aid fixes already for file stream > sink. For example, > > https://github.com/apache/spark/pull/28363 > https://github.com/apache/spark/pull/28904 > https://github.com/apache/spark/pull/29505 > https://github.com/apache/spark/pull/31638 > > There were many push backs, because these fixes do not solve the real > problem. The consensus was that we don't want to come up with another Data > Lake product which requires us to put months (or maybe years) of effort. > Now, these Data Lake products are backed by companies and they are > successful projects as individuals. I'm not sure I can be supportive with > the effort on another band-aid fix. > > Maintaining metadata directory is a root of the headache. Unless we see > the benefit of removing the metadata directory (hence at-least-once) and > plan to deal with that, I'd like to leave file stream sink as it is. > > On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk > wrote: > >> Hi Jungtaek, >> integration with Delta Lake is not an option to me, I raised a PR for >> improvement of FileStreamSink with the new parameter: >> https://github.com/apache/spark/pull/40821. Can you please take a look? >> >> -- >> Kind regards/ Pozdrawiam, >> Wojciech Indyk >> >> >> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim >> napisał(a): >> >>> Hi, >>> >>> We have been indicated with lots of issues with the current FileStream >>> sink. The effort to fix these issues are quite significant, and it ended up >>> with derivation of "Data Lake" products. >>> >>> I'd recommend not to fix the issue but leave it as its limitation, and >>> integrate your workload with Data Lake products. For a full disclaimer, I >>> work in Databricks so I might be biased, but even when I was working at the >>> previous employer which didn't have the Data Lake product at that time, I >>> also had to agree that there are too many things to fix, and the effort >>> would be fully redundant with existing products. >>> >>> Maybe, it might be helpful to have an "at-least-once" version of >>> FileStream sink, where a metadata directory is no longer needed. It may >>> require the implementation to go back to the old way of atomic renaming, >>> but it will also get rid of the necessity of a metadata directory, so >>> someone might find it useful. For end-to-end exactly once, people can >>> either use a limited current FileStream sink or use Data Lake products. I >>> don't see the value in making improvements to the current FileStream sink. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk >>> wrote: >>> Hi! I raised a ticket on parametrisable output metadata path https://issues.apache.org/jira/browse/SPARK-43152. I am going to raise a PR against it and I realised, that this relatively simple change impacts on method hasMetadata(path), that would have a new meaning if we can define custom path for metadata of output files. Can you please share your opinion on how the custom output metadata path can impact on design of structured streaming? E.g. I can see one case when I set a parameter of output metadata path, run a job on output path A, stop the job, change the output path to B and hasMetadata works well. If you have any corner case in mind where the parametrised output metadata path can break something please describe it. -- Kind regards/ Pozdrawiam, Wojciech Indyk >>>
Re: Parametrisable output metadata path
There seems to be miscommunication - I didn't mean "Delta Lake". I meant "any" Data Lake products. Since I'm biased I didn't intentionally enumerate actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well. We made non-trivial numbers of band-aid fixes already for file stream sink. For example, https://github.com/apache/spark/pull/28363 https://github.com/apache/spark/pull/28904 https://github.com/apache/spark/pull/29505 https://github.com/apache/spark/pull/31638 There were many push backs, because these fixes do not solve the real problem. The consensus was that we don't want to come up with another Data Lake product which requires us to put months (or maybe years) of effort. Now, these Data Lake products are backed by companies and they are successful projects as individuals. I'm not sure I can be supportive with the effort on another band-aid fix. Maintaining metadata directory is a root of the headache. Unless we see the benefit of removing the metadata directory (hence at-least-once) and plan to deal with that, I'd like to leave file stream sink as it is. On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk wrote: > Hi Jungtaek, > integration with Delta Lake is not an option to me, I raised a PR for > improvement of FileStreamSink with the new parameter: > https://github.com/apache/spark/pull/40821. Can you please take a look? > > -- > Kind regards/ Pozdrawiam, > Wojciech Indyk > > > niedz., 16 kwi 2023 o 04:45 Jungtaek Lim > napisał(a): > >> Hi, >> >> We have been indicated with lots of issues with the current FileStream >> sink. The effort to fix these issues are quite significant, and it ended up >> with derivation of "Data Lake" products. >> >> I'd recommend not to fix the issue but leave it as its limitation, and >> integrate your workload with Data Lake products. For a full disclaimer, I >> work in Databricks so I might be biased, but even when I was working at the >> previous employer which didn't have the Data Lake product at that time, I >> also had to agree that there are too many things to fix, and the effort >> would be fully redundant with existing products. >> >> Maybe, it might be helpful to have an "at-least-once" version of >> FileStream sink, where a metadata directory is no longer needed. It may >> require the implementation to go back to the old way of atomic renaming, >> but it will also get rid of the necessity of a metadata directory, so >> someone might find it useful. For end-to-end exactly once, people can >> either use a limited current FileStream sink or use Data Lake products. I >> don't see the value in making improvements to the current FileStream sink. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk >> wrote: >> >>> Hi! >>> I raised a ticket on parametrisable output metadata path >>> https://issues.apache.org/jira/browse/SPARK-43152. >>> I am going to raise a PR against it and I realised, that this relatively >>> simple change impacts on method hasMetadata(path), that would have a new >>> meaning if we can define custom path for metadata of output files. Can you >>> please share your opinion on how the custom output metadata path can >>> impact on design of structured streaming? >>> E.g. I can see one case when I set a parameter of output metadata path, >>> run a job on output path A, stop the job, change the output path to B and >>> hasMetadata works well. If you have any corner case in mind where the >>> parametrised output metadata path can break something please describe it. >>> >>> -- >>> Kind regards/ Pozdrawiam, >>> Wojciech Indyk >>> >>
Re: [ANNOUNCE] Apache Spark 3.4.0 released
Thank you, Dongjoon! On Sat, Apr 15, 2023 at 9:04 AM Dongjoon Hyun wrote: > Nice catch, Xiao! > > All `latest` tags are updated to v3.4.0 now. > > https://hub.docker.com/r/apache/spark/tags > https://hub.docker.com/r/apache/spark-py/tags > https://hub.docker.com/r/apache/spark-r/tags > > Dongjoon. > > > On Fri, Apr 14, 2023 at 8:38 PM Xiao Li wrote: > >> @Dongjoon Hyun Thank you! >> >> Could you also help update the latest tag ? >> https://hub.docker.com/r/apache/spark/tags >> >> Xiao >> >> Dongjoon Hyun 于2023年4月14日周五 16:23写道: >> >>> Apache Spark Docker images are published too. >>> >>> docker pull apache/spark:v3.4.0 >>> docker pull apache/spark-py:v3.4.0 >>> docker pull apache/spark-r:v3.4.0 >>> >>> Thanks, >>> Dongjoon >>> >>> >>> On Fri, Apr 14, 2023 at 2:56 PM Dongjoon Hyun >>> wrote: >>> Thank you, Xinrong! Dongjoon. On Fri, Apr 14, 2023 at 1:37 PM Xiao Li wrote: > Thank you Xinrong! > > Congratulations everyone! This is a great release with tons of new > features! > > > > Gengliang Wang 于2023年4月14日周五 13:04写道: > >> Congratulations everyone! >> Thank you Xinrong for driving the release! >> >> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng < >> xinrong.apa...@gmail.com> wrote: >> >>> Hi All, >>> >>> We are happy to announce the availability of *Apache Spark 3.4.0*! >>> >>> Apache Spark 3.4.0 is the fifth release of the 3.x line. >>> >>> To download Spark 3.4.0, head over to the download page: >>> https://spark.apache.org/downloads.html >>> >>> To view the release notes: >>> https://spark.apache.org/releases/spark-release-3-4-0.html >>> >>> We would like to acknowledge all community members for contributing >>> to this >>> release. This release would not have been possible without you. >>> >>> Thanks, >>> >>> Xinrong Meng >>> >>
Re: Spark Multiple Hive Metastore Catalog Support
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports connecting multiple HMS in a single Spark application. Some limitations - currently only supports Spark 3.3 - has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and normal jar-based Spark application. [1] https://github.com/apache/kyuubi/tree/master/extensions/spark/kyuubi-spark-connector-hive Thanks, Cheng Pan On Apr 18, 2023 at 00:38:23, Elliot West wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI: https://github.com/ExpediaGroup/waggle-dance > > This may be useful or perhaps serve as inspiration. > > Thanks, > > Elliot. > > On Mon, 17 Apr 2023 at 16:38, Ankit Gupta wrote: > >> ++ >> User Mailing List >> >> Just a reminder, anyone who can help on this. >> >> Thanks a lot ! >> >> Ankit Prakash Gupta >> >> On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta >> wrote: >> >>> Hi All >>> >>> The question is regarding the support of multiple Remote Hive Metastore >>> catalogs with Spark. Starting Spark 3, multiple catalog support is added in >>> spark, but have we implemented any CatalogPlugin that can help us configure >>> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with >>> the Fully Qualified Class Name that I can try using for configuring a Hive >>> Metastore Catalog. If not, I would like to work on the implementation of >>> the CatalogPlugin that we can use to configure multiple Hive Metastore >>> Servers' . >>> >>> Thanks and Regards. >>> >>> Ankit Prakash Gupta >>> +91 8750101321 >>> info.ank...@gmail.com >>> >>>
Re: Spark Multiple Hive Metastore Catalog Support
Hi Ankit, While not a part of Spark, there is a project called 'WaggleDance' that can federate multiple Hive metastores so that they are accessible via a single URI: https://github.com/ExpediaGroup/waggle-dance This may be useful or perhaps serve as inspiration. Thanks, Elliot. On Mon, 17 Apr 2023 at 16:38, Ankit Gupta wrote: > ++ > User Mailing List > > Just a reminder, anyone who can help on this. > > Thanks a lot ! > > Ankit Prakash Gupta > > On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta wrote: > >> Hi All >> >> The question is regarding the support of multiple Remote Hive Metastore >> catalogs with Spark. Starting Spark 3, multiple catalog support is added in >> spark, but have we implemented any CatalogPlugin that can help us configure >> multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with >> the Fully Qualified Class Name that I can try using for configuring a Hive >> Metastore Catalog. If not, I would like to work on the implementation of >> the CatalogPlugin that we can use to configure multiple Hive Metastore >> Servers' . >> >> Thanks and Regards. >> >> Ankit Prakash Gupta >> +91 8750101321 >> info.ank...@gmail.com >> >>
Re: Spark Multiple Hive Metastore Catalog Support
++ User Mailing List Just a reminder, anyone who can help on this. Thanks a lot ! Ankit Prakash Gupta On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta wrote: > Hi All > > The question is regarding the support of multiple Remote Hive Metastore > catalogs with Spark. Starting Spark 3, multiple catalog support is added in > spark, but have we implemented any CatalogPlugin that can help us configure > multiple Remote Hive Metastore Catalogs ? If yes, can anyone help me with > the Fully Qualified Class Name that I can try using for configuring a Hive > Metastore Catalog. If not, I would like to work on the implementation of > the CatalogPlugin that we can use to configure multiple Hive Metastore > Servers' . > > Thanks and Regards. > > Ankit Prakash Gupta > +91 8750101321 > info.ank...@gmail.com > >
Spark 3.2.4 pom NOT FOUND on maven
Hi, thanks for the Spark 3.2.4 release. I have found that Maven does not serve the spark-parent_2.13 pom file. It is listed in the directory: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/ But cannot be downloaded: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom The 2.12 file is fine: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.12/3.2.4/spark-parent_2.12-3.2.4.pom Any chance this can be fixed? Cheers, Enrico - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Parametrisable output metadata path
Hi Jungtaek, integration with Delta Lake is not an option to me, I raised a PR for improvement of FileStreamSink with the new parameter: https://github.com/apache/spark/pull/40821. Can you please take a look? -- Kind regards/ Pozdrawiam, Wojciech Indyk niedz., 16 kwi 2023 o 04:45 Jungtaek Lim napisał(a): > Hi, > > We have been indicated with lots of issues with the current FileStream > sink. The effort to fix these issues are quite significant, and it ended up > with derivation of "Data Lake" products. > > I'd recommend not to fix the issue but leave it as its limitation, and > integrate your workload with Data Lake products. For a full disclaimer, I > work in Databricks so I might be biased, but even when I was working at the > previous employer which didn't have the Data Lake product at that time, I > also had to agree that there are too many things to fix, and the effort > would be fully redundant with existing products. > > Maybe, it might be helpful to have an "at-least-once" version of > FileStream sink, where a metadata directory is no longer needed. It may > require the implementation to go back to the old way of atomic renaming, > but it will also get rid of the necessity of a metadata directory, so > someone might find it useful. For end-to-end exactly once, people can > either use a limited current FileStream sink or use Data Lake products. I > don't see the value in making improvements to the current FileStream sink. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk > wrote: > >> Hi! >> I raised a ticket on parametrisable output metadata path >> https://issues.apache.org/jira/browse/SPARK-43152. >> I am going to raise a PR against it and I realised, that this relatively >> simple change impacts on method hasMetadata(path), that would have a new >> meaning if we can define custom path for metadata of output files. Can you >> please share your opinion on how the custom output metadata path can >> impact on design of structured streaming? >> E.g. I can see one case when I set a parameter of output metadata path, >> run a job on output path A, stop the job, change the output path to B and >> hasMetadata works well. If you have any corner case in mind where the >> parametrised output metadata path can break something please describe it. >> >> -- >> Kind regards/ Pozdrawiam, >> Wojciech Indyk >> >
The Spark email setting should be update
Hi, everyone. I find that every time I reply to dev's mailing list, the default address of the reply is the sender of the mail, not dev@spark.apache.org. It caused me to think that the email reply to dev was successful several times, but it wasn't. This should not be a common problem, because when I reply to emails from other communities, the default reply address is d...@xxx.apache.org. Can spark modify the corresponding settings to reduce the chance of developers replying incorrectly? Thanks Jia Fan