unsubscribe

2023-10-18 Thread ankur



Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it is
a rather general and usual question for every new addition of data source.
Hence I want to sort it out for everyone.

As I know, the author implemented a third-party tool for query state store
> as a data source long time ago. I've suggested some users to use the tool
> before. It is a useful tool for special cases because there is no other
> tool/feature for the purpose.
> I think for such effort to add new data source, one usual question is why
> it has to be in Spark repo instead of as a third-party tool. Especially
> this is not a frequent used one. Even for structured stream users, only
> rare cases it is necessary to look into state store content.


I think we do not expect the data source to be used rarely. We see two
different major use cases; 1) unit tests against stateful query 2) look
into the state during the incident to get full context. 2) is probably not
something users may encounter this frequently, hence it is valid to say the
new feature may not be used frequently. But 1) is definitely something we
can say it's tied to daily work.

Also, even 2), it looks to be an essential feature and has to be provided
as out-of-the-box. Let's say, this feature does not exist and an user
encounters an incident in production with a stateful query. During RCA,
they realize that state is a black-box and their only option is deducing
the value of the state indirectly, mostly likely requiring them to modify
the query heavily and put artificial inputs. If I were such a user, I would
consider this lack as a fundamental issue of SS. It has been out-of-the-box
in Flink for years (State Processor), so it also makes sense for
competitive points.

We are seeing this effort as a stepping stone. As we see comments in SPIP
doc and also previous replies, people also see the proposal as a prior work
for writer part, which we would have a chance to break the strong
preconception for fixed number of shuffle partitions. I'd argue that this
is a rather fundamental limitation of SS and I have seen so many complaints
with this. I don't feel like it is right to delegate to a 3rd party to
solve the fundamental issue. This is probably stronger evidence than the
reader part.

Here's another aspect, during the work, we observed the lacking parts on
checkpointing e.g. the information of prefix scan does not exist in the
checkpoint, which makes a big difference on restoring the state from the
state file. When we come to the state repartitioning, the repartition is
based on the grouping keys in the operator (not the state key), hence we
will also need additional information for that. If this feature goes into
the 3rd party, it will be very painful to make both sides of the changes
altogether. It brings up another headache, versioning and compatibility
matrix.

I hope this would help persuade people to add this to the Spark repo rather
than its own life.


On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim 
wrote:

> Thanks Raghu for your support!
>
> Btw, I'd like to replicate the support from JIRA ticket itself, I see
> support from Chaoqin and Praveen. Thanks both!
>
>
>
> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
> wrote:
>
>> +1 overall and a big +1 to keeping offline state-rebalancing as a primary
>> use case.
>>
>> Raghu.
>>
>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>
>>> +1 for me. It seems like a prerequisite for further ops-related
>>> improvements for the state store management. I mean especially here the
>>> state rebalancing that could rely on this read+write state store API. I
>>> don't mean here the dynamic state rebalancing that could probably be
>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>> thinking more of an offline job to rebalance the state and later restart
>>> the stateful pipeline with the changed number of shuffle partitions.
>>>
>>> Best,
>>> Bartosz.
>>>
>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 bump for better reach

 On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Sorry, please use this link instead for SPIP doc:
> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>
>
> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi dev,
>>
>> I'd like to start a discussion on "State Data Source - Reader".
>>
>> This proposal aims to introduce a new data source "statestore" which
>> enables reading the state rows from existing checkpoint via offline 
>> (batch)
>> query. This will enable users to 1) create unit tests against stateful
>> query verifying the state value (especially flatMapGroupsWithState), 2)
>> gather more context 

Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Thanks Raghu for your support!

Btw, I'd like to replicate the support from JIRA ticket itself, I see
support from Chaoqin and Praveen. Thanks both!



On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
wrote:

> +1 overall and a big +1 to keeping offline state-rebalancing as a primary
> use case.
>
> Raghu.
>
> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
> bartkoniec...@gmail.com> wrote:
>
>> Thank you, Jungtaek, for your answers! It's clear now.
>>
>> +1 for me. It seems like a prerequisite for further ops-related
>> improvements for the state store management. I mean especially here the
>> state rebalancing that could rely on this read+write state store API. I
>> don't mean here the dynamic state rebalancing that could probably be
>> implemented with a lower latency directly in the stateful API. Instead I'm
>> thinking more of an offline job to rebalance the state and later restart
>> the stateful pipeline with the changed number of shuffle partitions.
>>
>> Best,
>> Bartosz.
>>
>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> bump for better reach
>>>
>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Sorry, please use this link instead for SPIP doc:
 https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing


 On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Hi dev,
>
> I'd like to start a discussion on "State Data Source - Reader".
>
> This proposal aims to introduce a new data source "statestore" which
> enables reading the state rows from existing checkpoint via offline 
> (batch)
> query. This will enable users to 1) create unit tests against stateful
> query verifying the state value (especially flatMapGroupsWithState), 2)
> gather more context on the status when an incident occurs, especially for
> incorrect output.
>
> *SPIP*:
> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>
> Looking forward to your feedback!
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. The scope of the project is narrowed to the reader in this SPIP,
> since the writer requires us to consider more cases. We are planning on 
> it.
>

>>
>> --
>> Bartosz Konieczny
>> freelance data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Raghu Angadi
+1 overall and a big +1 to keeping offline state-rebalancing as a primary
use case.

Raghu.

On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny 
wrote:

> Thank you, Jungtaek, for your answers! It's clear now.
>
> +1 for me. It seems like a prerequisite for further ops-related
> improvements for the state store management. I mean especially here the
> state rebalancing that could rely on this read+write state store API. I
> don't mean here the dynamic state rebalancing that could probably be
> implemented with a lower latency directly in the stateful API. Instead I'm
> thinking more of an offline job to rebalance the state and later restart
> the stateful pipeline with the changed number of shuffle partitions.
>
> Best,
> Bartosz.
>
> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim 
> wrote:
>
>> bump for better reach
>>
>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Sorry, please use this link instead for SPIP doc:
>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>
>>>
>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi dev,

 I'd like to start a discussion on "State Data Source - Reader".

 This proposal aims to introduce a new data source "statestore" which
 enables reading the state rows from existing checkpoint via offline (batch)
 query. This will enable users to 1) create unit tests against stateful
 query verifying the state value (especially flatMapGroupsWithState), 2)
 gather more context on the status when an incident occurs, especially for
 incorrect output.

 *SPIP*:
 https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
 *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511

 Looking forward to your feedback!

 Thanks,
 Jungtaek Lim (HeartSaVioR)

 ps. The scope of the project is narrowed to the reader in this SPIP,
 since the writer requires us to consider more cases. We are planning on it.

>>>
>
> --
> Bartosz Konieczny
> freelance data engineer
> https://www.waitingforcode.com
> https://github.com/bartosz25/
> https://twitter.com/waitingforcode
>
>


unsubscribe

2023-10-18 Thread Duy Pham
unsubscribe


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Jungtaek Lim
Thanks Yuanjian for your support!

I've left a comment but to replicate here - I agree with your point. It's
really uneasy for a new feature to be stable from the initial version and
we might want to decide on breaking backward compatibility for
(semantic) bug fixes/improvements. Maybe we could mark the data source as
incubating/experimental and look for a couple of minor releases to see
whether the options/behaviors can be finalized.

On Wed, Oct 18, 2023 at 4:24 PM Yuanjian Li  wrote:

> +1, I have no issues with the practicality and value of this feature
> itself.
> I've left some comments concerning ongoing maintenance and
> compatibility-related matters, which we can continue to discuss.
>
> Jungtaek Lim  于2023年10月17日周二 05:23写道:
>
>> Thanks Bartosz and Anish for your support!
>>
>> I'll wait for a couple more days to see whether we can hear more voices
>> on this. We could probably look for initiating a VOTE thread if there is no
>> objection.
>>
>> On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar <
>> anish.shrigonde...@databricks.com> wrote:
>>
>>> Hi Jungtaek,
>>>
>>> Thanks for putting this together. +1 from me and looks good overall.
>>> Posted some minor comments/questions to the doc.
>>>
>>> Thanks,
>>> Anish
>>>
>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>> bartkoniec...@gmail.com> wrote:
>>>
 Thank you, Jungtaek, for your answers! It's clear now.

 +1 for me. It seems like a prerequisite for further ops-related
 improvements for the state store management. I mean especially here the
 state rebalancing that could rely on this read+write state store API. I
 don't mean here the dynamic state rebalancing that could probably be
 implemented with a lower latency directly in the stateful API. Instead I'm
 thinking more of an offline job to rebalance the state and later restart
 the stateful pipeline with the changed number of shuffle partitions.

 Best,
 Bartosz.

 On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> bump for better reach
>
> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Sorry, please use this link instead for SPIP doc:
>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>
>>
>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi dev,
>>>
>>> I'd like to start a discussion on "State Data Source - Reader".
>>>
>>> This proposal aims to introduce a new data source "statestore" which
>>> enables reading the state rows from existing checkpoint via offline 
>>> (batch)
>>> query. This will enable users to 1) create unit tests against stateful
>>> query verifying the state value (especially flatMapGroupsWithState), 2)
>>> gather more context on the status when an incident occurs, especially 
>>> for
>>> incorrect output.
>>>
>>> *SPIP*:
>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>
>>> Looking forward to your feedback!
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. The scope of the project is narrowed to the reader in this SPIP,
>>> since the writer requires us to consider more cases. We are planning on 
>>> it.
>>>
>>

 --
 Bartosz Konieczny
 freelance data engineer
 https://www.waitingforcode.com
 https://github.com/bartosz25/
 https://twitter.com/waitingforcode




Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Yuanjian Li
+1, I have no issues with the practicality and value of this feature itself.
I've left some comments concerning ongoing maintenance and
compatibility-related matters, which we can continue to discuss.

Jungtaek Lim  于2023年10月17日周二 05:23写道:

> Thanks Bartosz and Anish for your support!
>
> I'll wait for a couple more days to see whether we can hear more voices on
> this. We could probably look for initiating a VOTE thread if there is no
> objection.
>
> On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar <
> anish.shrigonde...@databricks.com> wrote:
>
>> Hi Jungtaek,
>>
>> Thanks for putting this together. +1 from me and looks good overall.
>> Posted some minor comments/questions to the doc.
>>
>> Thanks,
>> Anish
>>
>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>
>>> +1 for me. It seems like a prerequisite for further ops-related
>>> improvements for the state store management. I mean especially here the
>>> state rebalancing that could rely on this read+write state store API. I
>>> don't mean here the dynamic state rebalancing that could probably be
>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>> thinking more of an offline job to rebalance the state and later restart
>>> the stateful pipeline with the changed number of shuffle partitions.
>>>
>>> Best,
>>> Bartosz.
>>>
>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 bump for better reach

 On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Sorry, please use this link instead for SPIP doc:
> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>
>
> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi dev,
>>
>> I'd like to start a discussion on "State Data Source - Reader".
>>
>> This proposal aims to introduce a new data source "statestore" which
>> enables reading the state rows from existing checkpoint via offline 
>> (batch)
>> query. This will enable users to 1) create unit tests against stateful
>> query verifying the state value (especially flatMapGroupsWithState), 2)
>> gather more context on the status when an incident occurs, especially for
>> incorrect output.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. The scope of the project is narrowed to the reader in this SPIP,
>> since the writer requires us to consider more cases. We are planning on 
>> it.
>>
>
>>>
>>> --
>>> Bartosz Konieczny
>>> freelance data engineer
>>> https://www.waitingforcode.com
>>> https://github.com/bartosz25/
>>> https://twitter.com/waitingforcode
>>>
>>>