Re: [VOTE] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
Starting with my +1 (non-binding). Thanks!

On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim 
wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: State Data Source - Reader.
>
> The high level summary of the SPIP is that we propose a new data source
> which enables a read ability for state store in the checkpoint, via batch
> query. This would enable two major use cases 1) constructing tests with
> verifying state store 2) inspecting values in state store in the scenario
> of incident.
>
> References:
>
>- JIRA ticket 
>- SPIP doc
>
> 
>- Discussion thread
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Jungtaek Lim (HeartSaVioR)
>


[VOTE] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
Hi all,

I'd like to start the vote for SPIP: State Data Source - Reader.

The high level summary of the SPIP is that we propose a new data source
which enables a read ability for state store in the checkpoint, via batch
query. This would enable two major use cases 1) constructing tests with
verifying state store 2) inspecting values in state store in the scenario
of incident.

References:

   - JIRA ticket 
   - SPIP doc
   

   - Discussion thread
   

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks!
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-22 Thread Jungtaek Lim
I don't see major comments as of now. Given that the thread was initiated
more than 10 days ago and I see multiple supporters, I'm going to initiate
a VOTE thread.

Please participate in the VOTE thread as well. Thanks!

On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim 
wrote:

> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
> is a rather general and usual question for every new addition of data
> source. Hence I want to sort it out for everyone.
>
> As I know, the author implemented a third-party tool for query state store
>> as a data source long time ago. I've suggested some users to use the tool
>> before. It is a useful tool for special cases because there is no other
>> tool/feature for the purpose.
>> I think for such effort to add new data source, one usual question is why
>> it has to be in Spark repo instead of as a third-party tool. Especially
>> this is not a frequent used one. Even for structured stream users, only
>> rare cases it is necessary to look into state store content.
>
>
> I think we do not expect the data source to be used rarely. We see two
> different major use cases; 1) unit tests against stateful query 2) look
> into the state during the incident to get full context. 2) is probably not
> something users may encounter this frequently, hence it is valid to say the
> new feature may not be used frequently. But 1) is definitely something we
> can say it's tied to daily work.
>
> Also, even 2), it looks to be an essential feature and has to be provided
> as out-of-the-box. Let's say, this feature does not exist and an user
> encounters an incident in production with a stateful query. During RCA,
> they realize that state is a black-box and their only option is deducing
> the value of the state indirectly, mostly likely requiring them to modify
> the query heavily and put artificial inputs. If I were such a user, I would
> consider this lack as a fundamental issue of SS. It has been out-of-the-box
> in Flink for years (State Processor), so it also makes sense for
> competitive points.
>
> We are seeing this effort as a stepping stone. As we see comments in SPIP
> doc and also previous replies, people also see the proposal as a prior work
> for writer part, which we would have a chance to break the strong
> preconception for fixed number of shuffle partitions. I'd argue that this
> is a rather fundamental limitation of SS and I have seen so many complaints
> with this. I don't feel like it is right to delegate to a 3rd party to
> solve the fundamental issue. This is probably stronger evidence than the
> reader part.
>
> Here's another aspect, during the work, we observed the lacking parts on
> checkpointing e.g. the information of prefix scan does not exist in the
> checkpoint, which makes a big difference on restoring the state from the
> state file. When we come to the state repartitioning, the repartition is
> based on the grouping keys in the operator (not the state key), hence we
> will also need additional information for that. If this feature goes into
> the 3rd party, it will be very painful to make both sides of the changes
> altogether. It brings up another headache, versioning and compatibility
> matrix.
>
> I hope this would help persuade people to add this to the Spark repo
> rather than its own life.
>
>
> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Thanks Raghu for your support!
>>
>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>> support from Chaoqin and Praveen. Thanks both!
>>
>>
>>
>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi 
>> wrote:
>>
>>> +1 overall and a big +1 to keeping offline state-rebalancing as a
>>> primary use case.
>>>
>>> Raghu.
>>>
>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>> bartkoniec...@gmail.com> wrote:
>>>
 Thank you, Jungtaek, for your answers! It's clear now.

 +1 for me. It seems like a prerequisite for further ops-related
 improvements for the state store management. I mean especially here the
 state rebalancing that could rely on this read+write state store API. I
 don't mean here the dynamic state rebalancing that could probably be
 implemented with a lower latency directly in the stateful API. Instead I'm
 thinking more of an offline job to rebalance the state and later restart
 the stateful pipeline with the changed number of shuffle partitions.

 Best,
 Bartosz.

 On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> bump for better reach
>
> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Sorry, please use this link instead for SPIP doc:
>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>
>>
>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>