Re: Parametrisable output metadata path

2023-04-18 Thread Wojciech Indyk
Thank you for your response!
I misread "data lake" as "delta lake", my bad. Anyway I need to write
output to file system. I see your point about data lakes, however
migrations take time, so at least from this perspective I wouldn't
deprecate FileStreamSink. I hope FileStreamSink will be still maintained. I
understand that in background of rapid development of data lakes the
FileStreamSink is not a pririty at all, so that I prepared the PR to help
with a part of work. The other part is review that I kindly ask. IMO my PR
is not a "band-aid fix", rather a low hanging fruit improvement that helps
with a few issues. I might be biased obviously. :)

--
Kind regards/ Pozdrawiam,
Wojciech Indyk


pon., 17 kwi 2023 o 22:42 Jungtaek Lim 
napisał(a):

> small correction: "I intentionally didn't enumerate." The meaning could be
> quite different so making a small correction.
>
> On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim 
> wrote:
>
>> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
>> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
>> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>>
>> We made non-trivial numbers of band-aid fixes already for file stream
>> sink. For example,
>>
>> https://github.com/apache/spark/pull/28363
>> https://github.com/apache/spark/pull/28904
>> https://github.com/apache/spark/pull/29505
>> https://github.com/apache/spark/pull/31638
>>
>> There were many push backs, because these fixes do not solve the real
>> problem. The consensus was that we don't want to come up with another Data
>> Lake product which requires us to put months (or maybe years) of effort.
>> Now, these Data Lake products are backed by companies and they are
>> successful projects as individuals. I'm not sure I can be supportive with
>> the effort on another band-aid fix.
>>
>> Maintaining metadata directory is a root of the headache. Unless we see
>> the benefit of removing the metadata directory (hence at-least-once) and
>> plan to deal with that, I'd like to leave file stream sink as it is.
>>
>> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
>> wrote:
>>
>>> Hi Jungtaek,
>>> integration with Delta Lake is not an option to me, I raised a PR for
>>> improvement of FileStreamSink with the new parameter:
>>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>>
>>> --
>>> Kind regards/ Pozdrawiam,
>>> Wojciech Indyk
>>>
>>>
>>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
>>> napisał(a):
>>>
>>>> Hi,
>>>>
>>>> We have been indicated with lots of issues with the current FileStream
>>>> sink. The effort to fix these issues are quite significant, and it ended up
>>>> with derivation of "Data Lake" products.
>>>>
>>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>>> work in Databricks so I might be biased, but even when I was working at the
>>>> previous employer which didn't have the Data Lake product at that time, I
>>>> also had to agree that there are too many things to fix, and the effort
>>>> would be fully redundant with existing products.
>>>>
>>>> Maybe, it might be helpful to have an "at-least-once" version of
>>>> FileStream sink, where a metadata directory is no longer needed. It may
>>>> require the implementation to go back to the old way of atomic renaming,
>>>> but it will also get rid of the necessity of a metadata directory, so
>>>> someone might find it useful. For end-to-end exactly once, people can
>>>> either use a limited current FileStream sink or use Data Lake products. I
>>>> don't see the value in making improvements to the current FileStream sink.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>>>> wrote:
>>>>
>>>>> Hi!
>>>>> I raised a ticket on parametrisable output metadata path
>>>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>>>> I am going to raise a PR against it and I realised, that this
>>>>> relatively simple change impacts on method hasMetadata(path), that would
>>>>> have a new meaning if we can define custom path for metadata of output
>>>>> files. Can you please share your opinion on  how the custom output 
>>>>> metadata
>>>>> path can impact on design of structured streaming?
>>>>> E.g. I can see one case when I set a parameter of output metadata
>>>>> path, run a job on output path A, stop the job, change the output path to 
>>>>> B
>>>>> and hasMetadata works well. If you have any corner case in mind where the
>>>>> parametrised output metadata path can break something please describe it.
>>>>>
>>>>> --
>>>>> Kind regards/ Pozdrawiam,
>>>>> Wojciech Indyk
>>>>>
>>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
small correction: "I intentionally didn't enumerate." The meaning could be
quite different so making a small correction.

On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim 
wrote:

> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>
> We made non-trivial numbers of band-aid fixes already for file stream
> sink. For example,
>
> https://github.com/apache/spark/pull/28363
> https://github.com/apache/spark/pull/28904
> https://github.com/apache/spark/pull/29505
> https://github.com/apache/spark/pull/31638
>
> There were many push backs, because these fixes do not solve the real
> problem. The consensus was that we don't want to come up with another Data
> Lake product which requires us to put months (or maybe years) of effort.
> Now, these Data Lake products are backed by companies and they are
> successful projects as individuals. I'm not sure I can be supportive with
> the effort on another band-aid fix.
>
> Maintaining metadata directory is a root of the headache. Unless we see
> the benefit of removing the metadata directory (hence at-least-once) and
> plan to deal with that, I'd like to leave file stream sink as it is.
>
> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
> wrote:
>
>> Hi Jungtaek,
>> integration with Delta Lake is not an option to me, I raised a PR for
>> improvement of FileStreamSink with the new parameter:
>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>>
>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
>> napisał(a):
>>
>>> Hi,
>>>
>>> We have been indicated with lots of issues with the current FileStream
>>> sink. The effort to fix these issues are quite significant, and it ended up
>>> with derivation of "Data Lake" products.
>>>
>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>> work in Databricks so I might be biased, but even when I was working at the
>>> previous employer which didn't have the Data Lake product at that time, I
>>> also had to agree that there are too many things to fix, and the effort
>>> would be fully redundant with existing products.
>>>
>>> Maybe, it might be helpful to have an "at-least-once" version of
>>> FileStream sink, where a metadata directory is no longer needed. It may
>>> require the implementation to go back to the old way of atomic renaming,
>>> but it will also get rid of the necessity of a metadata directory, so
>>> someone might find it useful. For end-to-end exactly once, people can
>>> either use a limited current FileStream sink or use Data Lake products. I
>>> don't see the value in making improvements to the current FileStream sink.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>>> wrote:
>>>
>>>> Hi!
>>>> I raised a ticket on parametrisable output metadata path
>>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>>> I am going to raise a PR against it and I realised, that this
>>>> relatively simple change impacts on method hasMetadata(path), that would
>>>> have a new meaning if we can define custom path for metadata of output
>>>> files. Can you please share your opinion on  how the custom output metadata
>>>> path can impact on design of structured streaming?
>>>> E.g. I can see one case when I set a parameter of output metadata path,
>>>> run a job on output path A, stop the job, change the output path to B and
>>>> hasMetadata works well. If you have any corner case in mind where the
>>>> parametrised output metadata path can break something please describe it.
>>>>
>>>> --
>>>> Kind regards/ Pozdrawiam,
>>>> Wojciech Indyk
>>>>
>>>


Re: Parametrisable output metadata path

2023-04-17 Thread Jungtaek Lim
There seems to be miscommunication - I didn't mean "Delta Lake". I meant
"any" Data Lake products. Since I'm biased I didn't intentionally enumerate
actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.

We made non-trivial numbers of band-aid fixes already for file stream sink.
For example,

https://github.com/apache/spark/pull/28363
https://github.com/apache/spark/pull/28904
https://github.com/apache/spark/pull/29505
https://github.com/apache/spark/pull/31638

There were many push backs, because these fixes do not solve the real
problem. The consensus was that we don't want to come up with another Data
Lake product which requires us to put months (or maybe years) of effort.
Now, these Data Lake products are backed by companies and they are
successful projects as individuals. I'm not sure I can be supportive with
the effort on another band-aid fix.

Maintaining metadata directory is a root of the headache. Unless we see the
benefit of removing the metadata directory (hence at-least-once) and plan
to deal with that, I'd like to leave file stream sink as it is.

On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk 
wrote:

> Hi Jungtaek,
> integration with Delta Lake is not an option to me, I raised a PR for
> improvement of FileStreamSink with the new parameter:
> https://github.com/apache/spark/pull/40821. Can you please take a look?
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>
>
> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
> napisał(a):
>
>> Hi,
>>
>> We have been indicated with lots of issues with the current FileStream
>> sink. The effort to fix these issues are quite significant, and it ended up
>> with derivation of "Data Lake" products.
>>
>> I'd recommend not to fix the issue but leave it as its limitation, and
>> integrate your workload with Data Lake products. For a full disclaimer, I
>> work in Databricks so I might be biased, but even when I was working at the
>> previous employer which didn't have the Data Lake product at that time, I
>> also had to agree that there are too many things to fix, and the effort
>> would be fully redundant with existing products.
>>
>> Maybe, it might be helpful to have an "at-least-once" version of
>> FileStream sink, where a metadata directory is no longer needed. It may
>> require the implementation to go back to the old way of atomic renaming,
>> but it will also get rid of the necessity of a metadata directory, so
>> someone might find it useful. For end-to-end exactly once, people can
>> either use a limited current FileStream sink or use Data Lake products. I
>> don't see the value in making improvements to the current FileStream sink.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
>> wrote:
>>
>>> Hi!
>>> I raised a ticket on parametrisable output metadata path
>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>> I am going to raise a PR against it and I realised, that this relatively
>>> simple change impacts on method hasMetadata(path), that would have a new
>>> meaning if we can define custom path for metadata of output files. Can you
>>> please share your opinion on  how the custom output metadata path can
>>> impact on design of structured streaming?
>>> E.g. I can see one case when I set a parameter of output metadata path,
>>> run a job on output path A, stop the job, change the output path to B and
>>> hasMetadata works well. If you have any corner case in mind where the
>>> parametrised output metadata path can break something please describe it.
>>>
>>> --
>>> Kind regards/ Pozdrawiam,
>>> Wojciech Indyk
>>>
>>


Re: Parametrisable output metadata path

2023-04-17 Thread Wojciech Indyk
Hi Jungtaek,
integration with Delta Lake is not an option to me, I raised a PR for
improvement of FileStreamSink with the new parameter:
https://github.com/apache/spark/pull/40821. Can you please take a look?

--
Kind regards/ Pozdrawiam,
Wojciech Indyk


niedz., 16 kwi 2023 o 04:45 Jungtaek Lim 
napisał(a):

> Hi,
>
> We have been indicated with lots of issues with the current FileStream
> sink. The effort to fix these issues are quite significant, and it ended up
> with derivation of "Data Lake" products.
>
> I'd recommend not to fix the issue but leave it as its limitation, and
> integrate your workload with Data Lake products. For a full disclaimer, I
> work in Databricks so I might be biased, but even when I was working at the
> previous employer which didn't have the Data Lake product at that time, I
> also had to agree that there are too many things to fix, and the effort
> would be fully redundant with existing products.
>
> Maybe, it might be helpful to have an "at-least-once" version of
> FileStream sink, where a metadata directory is no longer needed. It may
> require the implementation to go back to the old way of atomic renaming,
> but it will also get rid of the necessity of a metadata directory, so
> someone might find it useful. For end-to-end exactly once, people can
> either use a limited current FileStream sink or use Data Lake products. I
> don't see the value in making improvements to the current FileStream sink.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
> wrote:
>
>> Hi!
>> I raised a ticket on parametrisable output metadata path
>> https://issues.apache.org/jira/browse/SPARK-43152.
>> I am going to raise a PR against it and I realised, that this relatively
>> simple change impacts on method hasMetadata(path), that would have a new
>> meaning if we can define custom path for metadata of output files. Can you
>> please share your opinion on  how the custom output metadata path can
>> impact on design of structured streaming?
>> E.g. I can see one case when I set a parameter of output metadata path,
>> run a job on output path A, stop the job, change the output path to B and
>> hasMetadata works well. If you have any corner case in mind where the
>> parametrised output metadata path can break something please describe it.
>>
>> --
>> Kind regards/ Pozdrawiam,
>> Wojciech Indyk
>>
>


Re: Parametrisable output metadata path

2023-04-15 Thread Jungtaek Lim
Hi,

We have been indicated with lots of issues with the current FileStream
sink. The effort to fix these issues are quite significant, and it ended up
with derivation of "Data Lake" products.

I'd recommend not to fix the issue but leave it as its limitation, and
integrate your workload with Data Lake products. For a full disclaimer, I
work in Databricks so I might be biased, but even when I was working at the
previous employer which didn't have the Data Lake product at that time, I
also had to agree that there are too many things to fix, and the effort
would be fully redundant with existing products.

Maybe, it might be helpful to have an "at-least-once" version of FileStream
sink, where a metadata directory is no longer needed. It may require the
implementation to go back to the old way of atomic renaming, but it will
also get rid of the necessity of a metadata directory, so someone might
find it useful. For end-to-end exactly once, people can either use a
limited current FileStream sink or use Data Lake products. I don't see the
value in making improvements to the current FileStream sink.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk 
wrote:

> Hi!
> I raised a ticket on parametrisable output metadata path
> https://issues.apache.org/jira/browse/SPARK-43152.
> I am going to raise a PR against it and I realised, that this relatively
> simple change impacts on method hasMetadata(path), that would have a new
> meaning if we can define custom path for metadata of output files. Can you
> please share your opinion on  how the custom output metadata path can
> impact on design of structured streaming?
> E.g. I can see one case when I set a parameter of output metadata path,
> run a job on output path A, stop the job, change the output path to B and
> hasMetadata works well. If you have any corner case in mind where the
> parametrised output metadata path can break something please describe it.
>
> --
> Kind regards/ Pozdrawiam,
> Wojciech Indyk
>


Parametrisable output metadata path

2023-04-15 Thread Wojciech Indyk
Hi!
I raised a ticket on parametrisable output metadata path
https://issues.apache.org/jira/browse/SPARK-43152.
I am going to raise a PR against it and I realised, that this relatively
simple change impacts on method hasMetadata(path), that would have a new
meaning if we can define custom path for metadata of output files. Can you
please share your opinion on  how the custom output metadata path can
impact on design of structured streaming?
E.g. I can see one case when I set a parameter of output metadata path, run
a job on output path A, stop the job, change the output path to B and
hasMetadata works well. If you have any corner case in mind where the
parametrised output metadata path can break something please describe it.

--
Kind regards/ Pozdrawiam,
Wojciech Indyk