Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread L. C. Hsieh
+1

On Mon, Oct 23, 2023 at 6:31 PM Anish Shrigondekar
 wrote:
>
> +1 (non-binding)
>
> Thanks,
> Anish
>
> On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan  wrote:
>>
>> +1
>>
>> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim  
>> wrote:
>>>
>>> Starting with my +1 (non-binding). Thanks!
>>>
>>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim  
>>> wrote:

 Hi all,

 I'd like to start the vote for SPIP: State Data Source - Reader.

 The high level summary of the SPIP is that we propose a new data source 
 which enables a read ability for state store in the checkpoint, via batch 
 query. This would enable two major use cases 1) constructing tests with 
 verifying state store 2) inspecting values in state store in the scenario 
 of incident.

 References:

 JIRA ticket
 SPIP doc
 Discussion thread

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks!
 Jungtaek Lim (HeartSaVioR)

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Anish Shrigondekar
+1 (non-binding)

Thanks,
Anish

On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan  wrote:

> +1
>
> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim 
> wrote:
>
>> Starting with my +1 (non-binding). Thanks!
>>
>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: State Data Source - Reader.
>>>
>>> The high level summary of the SPIP is that we propose a new data source
>>> which enables a read ability for state store in the checkpoint, via batch
>>> query. This would enable two major use cases 1) constructing tests with
>>> verifying state store 2) inspecting values in state store in the scenario
>>> of incident.
>>>
>>> References:
>>>
>>>- JIRA ticket 
>>>- SPIP doc
>>>
>>> 
>>>- Discussion thread
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks!
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-23 Thread laglangyue
+1

发自我的iPhone


-- Original --
From: Jungtaek Lim https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1;(committer+can
 login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in 
your inbox. Every vote would be really appreciated!

On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing




On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
JIRA:https://issues.apache.org/jira/browse/SPARK-45511



Looking forward to your feedback!


Thanks,
Jungtaek Lim (HeartSaVioR)



ps. The scope of the project is narrowed to the reader in this SPIP, since the 
writer requires us to consider more cases. We are planning on it.

 
 
 


-- 
Bartosz Konieczny
freelance data engineer
https://www.waitingforcode.com
https://github.com/bartosz25/
https://twitter.com/waitingforcode

Re: [VOTE] SPIP: State Data Source - Reader

2023-10-23 Thread Wenchen Fan
+1

On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim 
wrote:

> Starting with my +1 (non-binding). Thanks!
>
> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: State Data Source - Reader.
>>
>> The high level summary of the SPIP is that we propose a new data source
>> which enables a read ability for state store in the checkpoint, via batch
>> query. This would enable two major use cases 1) constructing tests with
>> verifying state store 2) inspecting values in state store in the scenario
>> of incident.
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Jungtaek Lim (HeartSaVioR)
>>
>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-23 Thread Jungtaek Lim
FYI: VOTE thread is open, please check the link
https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1
(committer+ can
login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in
your inbox. Every vote would be really appreciated!

On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim 
wrote:

> I don't see major comments as of now. Given that the thread was initiated
> more than 10 days ago and I see multiple supporters, I'm going to initiate
> a VOTE thread.
>
> Please participate in the VOTE thread as well. Thanks!
>
> On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
>> is a rather general and usual question for every new addition of data
>> source. Hence I want to sort it out for everyone.
>>
>> As I know, the author implemented a third-party tool for query state
>>> store as a data source long time ago. I've suggested some users to use the
>>> tool before. It is a useful tool for special cases because there is no
>>> other tool/feature for the purpose.
>>> I think for such effort to add new data source, one usual question is
>>> why it has to be in Spark repo instead of as a third-party tool. Especially
>>> this is not a frequent used one. Even for structured stream users, only
>>> rare cases it is necessary to look into state store content.
>>
>>
>> I think we do not expect the data source to be used rarely. We see two
>> different major use cases; 1) unit tests against stateful query 2) look
>> into the state during the incident to get full context. 2) is probably not
>> something users may encounter this frequently, hence it is valid to say the
>> new feature may not be used frequently. But 1) is definitely something we
>> can say it's tied to daily work.
>>
>> Also, even 2), it looks to be an essential feature and has to be provided
>> as out-of-the-box. Let's say, this feature does not exist and an user
>> encounters an incident in production with a stateful query. During RCA,
>> they realize that state is a black-box and their only option is deducing
>> the value of the state indirectly, mostly likely requiring them to modify
>> the query heavily and put artificial inputs. If I were such a user, I would
>> consider this lack as a fundamental issue of SS. It has been out-of-the-box
>> in Flink for years (State Processor), so it also makes sense for
>> competitive points.
>>
>> We are seeing this effort as a stepping stone. As we see comments in SPIP
>> doc and also previous replies, people also see the proposal as a prior work
>> for writer part, which we would have a chance to break the strong
>> preconception for fixed number of shuffle partitions. I'd argue that this
>> is a rather fundamental limitation of SS and I have seen so many complaints
>> with this. I don't feel like it is right to delegate to a 3rd party to
>> solve the fundamental issue. This is probably stronger evidence than the
>> reader part.
>>
>> Here's another aspect, during the work, we observed the lacking parts on
>> checkpointing e.g. the information of prefix scan does not exist in the
>> checkpoint, which makes a big difference on restoring the state from the
>> state file. When we come to the state repartitioning, the repartition is
>> based on the grouping keys in the operator (not the state key), hence we
>> will also need additional information for that. If this feature goes into
>> the 3rd party, it will be very painful to make both sides of the changes
>> altogether. It brings up another headache, versioning and compatibility
>> matrix.
>>
>> I hope this would help persuade people to add this to the Spark repo
>> rather than its own life.
>>
>>
>> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Thanks Raghu for your support!
>>>
>>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>>> support from Chaoqin and Praveen. Thanks both!
>>>
>>>
>>>
>>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <
>>> raghu.ang...@databricks.com> wrote:
>>>
 +1 overall and a big +1 to keeping offline state-rebalancing as a
 primary use case.

 Raghu.

 On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
 bartkoniec...@gmail.com> wrote:

> Thank you, Jungtaek, for your answers! It's clear now.
>
> +1 for me. It seems like a prerequisite for further ops-related
> improvements for the state store management. I mean especially here the
> state rebalancing that could rely on this read+write state store API. I
> don't mean here the dynamic state rebalancing that could probably be
> implemented with a lower latency directly in the stateful API. Instead I'm
> thinking more of an offline job to rebalance the state and later restart
> the stateful pipeline with the changed number of shuffle partitions.
>
> Best,
> Bartosz.
>
> On Mon, Oct 16, 2023 at 6:19 PM