Re: [DISCUSS] SPIP: State Data Source - Reader
+1 发自我的iPhone -- Original -- From: Jungtaek Lim https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1;(committer+can login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in your inbox. Every vote would be really appreciated! On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing JIRA:https://issues.apache.org/jira/browse/SPARK-45511 Looking forward to your feedback! Thanks, Jungtaek Lim (HeartSaVioR) ps. The scope of the project is narrowed to the reader in this SPIP, since the writer requires us to consider more cases. We are planning on it. -- Bartosz Konieczny freelance data engineer https://www.waitingforcode.com https://github.com/bartosz25/ https://twitter.com/waitingforcode
Re: [DISCUSS] SPIP: State Data Source - Reader
FYI: VOTE thread is open, please check the link https://lists.apache.org/thread/7ohctj1gmqbhds56bntf4s2zst5qpll1 (committer+ can login to reply) or search with "[VOTE] SPIP: State Data Source - Reader" in your inbox. Every vote would be really appreciated! On Mon, Oct 23, 2023 at 1:06 PM Jungtaek Lim wrote: > I don't see major comments as of now. Given that the thread was initiated > more than 10 days ago and I see multiple supporters, I'm going to initiate > a VOTE thread. > > Please participate in the VOTE thread as well. Thanks! > > On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it >> is a rather general and usual question for every new addition of data >> source. Hence I want to sort it out for everyone. >> >> As I know, the author implemented a third-party tool for query state >>> store as a data source long time ago. I've suggested some users to use the >>> tool before. It is a useful tool for special cases because there is no >>> other tool/feature for the purpose. >>> I think for such effort to add new data source, one usual question is >>> why it has to be in Spark repo instead of as a third-party tool. Especially >>> this is not a frequent used one. Even for structured stream users, only >>> rare cases it is necessary to look into state store content. >> >> >> I think we do not expect the data source to be used rarely. We see two >> different major use cases; 1) unit tests against stateful query 2) look >> into the state during the incident to get full context. 2) is probably not >> something users may encounter this frequently, hence it is valid to say the >> new feature may not be used frequently. But 1) is definitely something we >> can say it's tied to daily work. >> >> Also, even 2), it looks to be an essential feature and has to be provided >> as out-of-the-box. Let's say, this feature does not exist and an user >> encounters an incident in production with a stateful query. During RCA, >> they realize that state is a black-box and their only option is deducing >> the value of the state indirectly, mostly likely requiring them to modify >> the query heavily and put artificial inputs. If I were such a user, I would >> consider this lack as a fundamental issue of SS. It has been out-of-the-box >> in Flink for years (State Processor), so it also makes sense for >> competitive points. >> >> We are seeing this effort as a stepping stone. As we see comments in SPIP >> doc and also previous replies, people also see the proposal as a prior work >> for writer part, which we would have a chance to break the strong >> preconception for fixed number of shuffle partitions. I'd argue that this >> is a rather fundamental limitation of SS and I have seen so many complaints >> with this. I don't feel like it is right to delegate to a 3rd party to >> solve the fundamental issue. This is probably stronger evidence than the >> reader part. >> >> Here's another aspect, during the work, we observed the lacking parts on >> checkpointing e.g. the information of prefix scan does not exist in the >> checkpoint, which makes a big difference on restoring the state from the >> state file. When we come to the state repartitioning, the repartition is >> based on the grouping keys in the operator (not the state key), hence we >> will also need additional information for that. If this feature goes into >> the 3rd party, it will be very painful to make both sides of the changes >> altogether. It brings up another headache, versioning and compatibility >> matrix. >> >> I hope this would help persuade people to add this to the Spark repo >> rather than its own life. >> >> >> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Thanks Raghu for your support! >>> >>> Btw, I'd like to replicate the support from JIRA ticket itself, I see >>> support from Chaoqin and Praveen. Thanks both! >>> >>> >>> >>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi < >>> raghu.ang...@databricks.com> wrote: >>> +1 overall and a big +1 to keeping offline state-rebalancing as a primary use case. Raghu. On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < bartkoniec...@gmail.com> wrote: > Thank you, Jungtaek, for your answers! It's clear now. > > +1 for me. It seems like a prerequisite for further ops-related > improvements for the state store management. I mean especially here the > state rebalancing that could rely on this read+write state store API. I > don't mean here the dynamic state rebalancing that could probably be > implemented with a lower latency directly in the stateful API. Instead I'm > thinking more of an offline job to rebalance the state and later restart > the stateful pipeline with the changed number of shuffle partitions. > > Best, > Bartosz. > > On Mon, Oct 16, 2023 at 6:19 PM
Re: [DISCUSS] SPIP: State Data Source - Reader
I don't see major comments as of now. Given that the thread was initiated more than 10 days ago and I see multiple supporters, I'm going to initiate a VOTE thread. Please participate in the VOTE thread as well. Thanks! On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim wrote: > Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it > is a rather general and usual question for every new addition of data > source. Hence I want to sort it out for everyone. > > As I know, the author implemented a third-party tool for query state store >> as a data source long time ago. I've suggested some users to use the tool >> before. It is a useful tool for special cases because there is no other >> tool/feature for the purpose. >> I think for such effort to add new data source, one usual question is why >> it has to be in Spark repo instead of as a third-party tool. Especially >> this is not a frequent used one. Even for structured stream users, only >> rare cases it is necessary to look into state store content. > > > I think we do not expect the data source to be used rarely. We see two > different major use cases; 1) unit tests against stateful query 2) look > into the state during the incident to get full context. 2) is probably not > something users may encounter this frequently, hence it is valid to say the > new feature may not be used frequently. But 1) is definitely something we > can say it's tied to daily work. > > Also, even 2), it looks to be an essential feature and has to be provided > as out-of-the-box. Let's say, this feature does not exist and an user > encounters an incident in production with a stateful query. During RCA, > they realize that state is a black-box and their only option is deducing > the value of the state indirectly, mostly likely requiring them to modify > the query heavily and put artificial inputs. If I were such a user, I would > consider this lack as a fundamental issue of SS. It has been out-of-the-box > in Flink for years (State Processor), so it also makes sense for > competitive points. > > We are seeing this effort as a stepping stone. As we see comments in SPIP > doc and also previous replies, people also see the proposal as a prior work > for writer part, which we would have a chance to break the strong > preconception for fixed number of shuffle partitions. I'd argue that this > is a rather fundamental limitation of SS and I have seen so many complaints > with this. I don't feel like it is right to delegate to a 3rd party to > solve the fundamental issue. This is probably stronger evidence than the > reader part. > > Here's another aspect, during the work, we observed the lacking parts on > checkpointing e.g. the information of prefix scan does not exist in the > checkpoint, which makes a big difference on restoring the state from the > state file. When we come to the state repartitioning, the repartition is > based on the grouping keys in the operator (not the state key), hence we > will also need additional information for that. If this feature goes into > the 3rd party, it will be very painful to make both sides of the changes > altogether. It brings up another headache, versioning and compatibility > matrix. > > I hope this would help persuade people to add this to the Spark repo > rather than its own life. > > > On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Thanks Raghu for your support! >> >> Btw, I'd like to replicate the support from JIRA ticket itself, I see >> support from Chaoqin and Praveen. Thanks both! >> >> >> >> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi >> wrote: >> >>> +1 overall and a big +1 to keeping offline state-rebalancing as a >>> primary use case. >>> >>> Raghu. >>> >>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >>> bartkoniec...@gmail.com> wrote: >>> Thank you, Jungtaek, for your answers! It's clear now. +1 for me. It seems like a prerequisite for further ops-related improvements for the state store management. I mean especially here the state rebalancing that could rely on this read+write state store API. I don't mean here the dynamic state rebalancing that could probably be implemented with a lower latency directly in the stateful API. Instead I'm thinking more of an offline job to rebalance the state and later restart the stateful pipeline with the changed number of shuffle partitions. Best, Bartosz. On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > bump for better reach > > On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Sorry, please use this link instead for SPIP doc: >> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >> >> >> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >>
Re: [DISCUSS] SPIP: State Data Source - Reader
Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it is a rather general and usual question for every new addition of data source. Hence I want to sort it out for everyone. As I know, the author implemented a third-party tool for query state store > as a data source long time ago. I've suggested some users to use the tool > before. It is a useful tool for special cases because there is no other > tool/feature for the purpose. > I think for such effort to add new data source, one usual question is why > it has to be in Spark repo instead of as a third-party tool. Especially > this is not a frequent used one. Even for structured stream users, only > rare cases it is necessary to look into state store content. I think we do not expect the data source to be used rarely. We see two different major use cases; 1) unit tests against stateful query 2) look into the state during the incident to get full context. 2) is probably not something users may encounter this frequently, hence it is valid to say the new feature may not be used frequently. But 1) is definitely something we can say it's tied to daily work. Also, even 2), it looks to be an essential feature and has to be provided as out-of-the-box. Let's say, this feature does not exist and an user encounters an incident in production with a stateful query. During RCA, they realize that state is a black-box and their only option is deducing the value of the state indirectly, mostly likely requiring them to modify the query heavily and put artificial inputs. If I were such a user, I would consider this lack as a fundamental issue of SS. It has been out-of-the-box in Flink for years (State Processor), so it also makes sense for competitive points. We are seeing this effort as a stepping stone. As we see comments in SPIP doc and also previous replies, people also see the proposal as a prior work for writer part, which we would have a chance to break the strong preconception for fixed number of shuffle partitions. I'd argue that this is a rather fundamental limitation of SS and I have seen so many complaints with this. I don't feel like it is right to delegate to a 3rd party to solve the fundamental issue. This is probably stronger evidence than the reader part. Here's another aspect, during the work, we observed the lacking parts on checkpointing e.g. the information of prefix scan does not exist in the checkpoint, which makes a big difference on restoring the state from the state file. When we come to the state repartitioning, the repartition is based on the grouping keys in the operator (not the state key), hence we will also need additional information for that. If this feature goes into the 3rd party, it will be very painful to make both sides of the changes altogether. It brings up another headache, versioning and compatibility matrix. I hope this would help persuade people to add this to the Spark repo rather than its own life. On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim wrote: > Thanks Raghu for your support! > > Btw, I'd like to replicate the support from JIRA ticket itself, I see > support from Chaoqin and Praveen. Thanks both! > > > > On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi > wrote: > >> +1 overall and a big +1 to keeping offline state-rebalancing as a primary >> use case. >> >> Raghu. >> >> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >> bartkoniec...@gmail.com> wrote: >> >>> Thank you, Jungtaek, for your answers! It's clear now. >>> >>> +1 for me. It seems like a prerequisite for further ops-related >>> improvements for the state store management. I mean especially here the >>> state rebalancing that could rely on this read+write state store API. I >>> don't mean here the dynamic state rebalancing that could probably be >>> implemented with a lower latency directly in the stateful API. Instead I'm >>> thinking more of an offline job to rebalance the state and later restart >>> the stateful pipeline with the changed number of shuffle partitions. >>> >>> Best, >>> Bartosz. >>> >>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> bump for better reach On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Sorry, please use this link instead for SPIP doc: > https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing > > > On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi dev, >> >> I'd like to start a discussion on "State Data Source - Reader". >> >> This proposal aims to introduce a new data source "statestore" which >> enables reading the state rows from existing checkpoint via offline >> (batch) >> query. This will enable users to 1) create unit tests against stateful >> query verifying the state value (especially flatMapGroupsWithState), 2) >> gather more context
Re: [DISCUSS] SPIP: State Data Source - Reader
Thanks Raghu for your support! Btw, I'd like to replicate the support from JIRA ticket itself, I see support from Chaoqin and Praveen. Thanks both! On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi wrote: > +1 overall and a big +1 to keeping offline state-rebalancing as a primary > use case. > > Raghu. > > On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < > bartkoniec...@gmail.com> wrote: > >> Thank you, Jungtaek, for your answers! It's clear now. >> >> +1 for me. It seems like a prerequisite for further ops-related >> improvements for the state store management. I mean especially here the >> state rebalancing that could rely on this read+write state store API. I >> don't mean here the dynamic state rebalancing that could probably be >> implemented with a lower latency directly in the stateful API. Instead I'm >> thinking more of an offline job to rebalance the state and later restart >> the stateful pipeline with the changed number of shuffle partitions. >> >> Best, >> Bartosz. >> >> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> bump for better reach >>> >>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Sorry, please use this link instead for SPIP doc: https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Hi dev, > > I'd like to start a discussion on "State Data Source - Reader". > > This proposal aims to introduce a new data source "statestore" which > enables reading the state rows from existing checkpoint via offline > (batch) > query. This will enable users to 1) create unit tests against stateful > query verifying the state value (especially flatMapGroupsWithState), 2) > gather more context on the status when an incident occurs, especially for > incorrect output. > > *SPIP*: > https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing > *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 > > Looking forward to your feedback! > > Thanks, > Jungtaek Lim (HeartSaVioR) > > ps. The scope of the project is narrowed to the reader in this SPIP, > since the writer requires us to consider more cases. We are planning on > it. > >> >> -- >> Bartosz Konieczny >> freelance data engineer >> https://www.waitingforcode.com >> https://github.com/bartosz25/ >> https://twitter.com/waitingforcode >> >>
Re: [DISCUSS] SPIP: State Data Source - Reader
+1 overall and a big +1 to keeping offline state-rebalancing as a primary use case. Raghu. On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny wrote: > Thank you, Jungtaek, for your answers! It's clear now. > > +1 for me. It seems like a prerequisite for further ops-related > improvements for the state store management. I mean especially here the > state rebalancing that could rely on this read+write state store API. I > don't mean here the dynamic state rebalancing that could probably be > implemented with a lower latency directly in the stateful API. Instead I'm > thinking more of an offline job to rebalance the state and later restart > the stateful pipeline with the changed number of shuffle partitions. > > Best, > Bartosz. > > On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim > wrote: > >> bump for better reach >> >> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Sorry, please use this link instead for SPIP doc: >>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >>> >>> >>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Hi dev, I'd like to start a discussion on "State Data Source - Reader". This proposal aims to introduce a new data source "statestore" which enables reading the state rows from existing checkpoint via offline (batch) query. This will enable users to 1) create unit tests against stateful query verifying the state value (especially flatMapGroupsWithState), 2) gather more context on the status when an incident occurs, especially for incorrect output. *SPIP*: https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 Looking forward to your feedback! Thanks, Jungtaek Lim (HeartSaVioR) ps. The scope of the project is narrowed to the reader in this SPIP, since the writer requires us to consider more cases. We are planning on it. >>> > > -- > Bartosz Konieczny > freelance data engineer > https://www.waitingforcode.com > https://github.com/bartosz25/ > https://twitter.com/waitingforcode > >
Re: [DISCUSS] SPIP: State Data Source - Reader
Thanks Yuanjian for your support! I've left a comment but to replicate here - I agree with your point. It's really uneasy for a new feature to be stable from the initial version and we might want to decide on breaking backward compatibility for (semantic) bug fixes/improvements. Maybe we could mark the data source as incubating/experimental and look for a couple of minor releases to see whether the options/behaviors can be finalized. On Wed, Oct 18, 2023 at 4:24 PM Yuanjian Li wrote: > +1, I have no issues with the practicality and value of this feature > itself. > I've left some comments concerning ongoing maintenance and > compatibility-related matters, which we can continue to discuss. > > Jungtaek Lim 于2023年10月17日周二 05:23写道: > >> Thanks Bartosz and Anish for your support! >> >> I'll wait for a couple more days to see whether we can hear more voices >> on this. We could probably look for initiating a VOTE thread if there is no >> objection. >> >> On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar < >> anish.shrigonde...@databricks.com> wrote: >> >>> Hi Jungtaek, >>> >>> Thanks for putting this together. +1 from me and looks good overall. >>> Posted some minor comments/questions to the doc. >>> >>> Thanks, >>> Anish >>> >>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >>> bartkoniec...@gmail.com> wrote: >>> Thank you, Jungtaek, for your answers! It's clear now. +1 for me. It seems like a prerequisite for further ops-related improvements for the state store management. I mean especially here the state rebalancing that could rely on this read+write state store API. I don't mean here the dynamic state rebalancing that could probably be implemented with a lower latency directly in the stateful API. Instead I'm thinking more of an offline job to rebalance the state and later restart the stateful pipeline with the changed number of shuffle partitions. Best, Bartosz. On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > bump for better reach > > On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Sorry, please use this link instead for SPIP doc: >> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >> >> >> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi dev, >>> >>> I'd like to start a discussion on "State Data Source - Reader". >>> >>> This proposal aims to introduce a new data source "statestore" which >>> enables reading the state rows from existing checkpoint via offline >>> (batch) >>> query. This will enable users to 1) create unit tests against stateful >>> query verifying the state value (especially flatMapGroupsWithState), 2) >>> gather more context on the status when an incident occurs, especially >>> for >>> incorrect output. >>> >>> *SPIP*: >>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >>> >>> Looking forward to your feedback! >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> ps. The scope of the project is narrowed to the reader in this SPIP, >>> since the writer requires us to consider more cases. We are planning on >>> it. >>> >> -- Bartosz Konieczny freelance data engineer https://www.waitingforcode.com https://github.com/bartosz25/ https://twitter.com/waitingforcode
Re: [DISCUSS] SPIP: State Data Source - Reader
+1, I have no issues with the practicality and value of this feature itself. I've left some comments concerning ongoing maintenance and compatibility-related matters, which we can continue to discuss. Jungtaek Lim 于2023年10月17日周二 05:23写道: > Thanks Bartosz and Anish for your support! > > I'll wait for a couple more days to see whether we can hear more voices on > this. We could probably look for initiating a VOTE thread if there is no > objection. > > On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar < > anish.shrigonde...@databricks.com> wrote: > >> Hi Jungtaek, >> >> Thanks for putting this together. +1 from me and looks good overall. >> Posted some minor comments/questions to the doc. >> >> Thanks, >> Anish >> >> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >> bartkoniec...@gmail.com> wrote: >> >>> Thank you, Jungtaek, for your answers! It's clear now. >>> >>> +1 for me. It seems like a prerequisite for further ops-related >>> improvements for the state store management. I mean especially here the >>> state rebalancing that could rely on this read+write state store API. I >>> don't mean here the dynamic state rebalancing that could probably be >>> implemented with a lower latency directly in the stateful API. Instead I'm >>> thinking more of an offline job to rebalance the state and later restart >>> the stateful pipeline with the changed number of shuffle partitions. >>> >>> Best, >>> Bartosz. >>> >>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> bump for better reach On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Sorry, please use this link instead for SPIP doc: > https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing > > > On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi dev, >> >> I'd like to start a discussion on "State Data Source - Reader". >> >> This proposal aims to introduce a new data source "statestore" which >> enables reading the state rows from existing checkpoint via offline >> (batch) >> query. This will enable users to 1) create unit tests against stateful >> query verifying the state value (especially flatMapGroupsWithState), 2) >> gather more context on the status when an incident occurs, especially for >> incorrect output. >> >> *SPIP*: >> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >> >> Looking forward to your feedback! >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> ps. The scope of the project is narrowed to the reader in this SPIP, >> since the writer requires us to consider more cases. We are planning on >> it. >> > >>> >>> -- >>> Bartosz Konieczny >>> freelance data engineer >>> https://www.waitingforcode.com >>> https://github.com/bartosz25/ >>> https://twitter.com/waitingforcode >>> >>>
Re: [DISCUSS] SPIP: State Data Source - Reader
Thanks Bartosz and Anish for your support! I'll wait for a couple more days to see whether we can hear more voices on this. We could probably look for initiating a VOTE thread if there is no objection. On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar < anish.shrigonde...@databricks.com> wrote: > Hi Jungtaek, > > Thanks for putting this together. +1 from me and looks good overall. > Posted some minor comments/questions to the doc. > > Thanks, > Anish > > On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < > bartkoniec...@gmail.com> wrote: > >> Thank you, Jungtaek, for your answers! It's clear now. >> >> +1 for me. It seems like a prerequisite for further ops-related >> improvements for the state store management. I mean especially here the >> state rebalancing that could rely on this read+write state store API. I >> don't mean here the dynamic state rebalancing that could probably be >> implemented with a lower latency directly in the stateful API. Instead I'm >> thinking more of an offline job to rebalance the state and later restart >> the stateful pipeline with the changed number of shuffle partitions. >> >> Best, >> Bartosz. >> >> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> bump for better reach >>> >>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Sorry, please use this link instead for SPIP doc: https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Hi dev, > > I'd like to start a discussion on "State Data Source - Reader". > > This proposal aims to introduce a new data source "statestore" which > enables reading the state rows from existing checkpoint via offline > (batch) > query. This will enable users to 1) create unit tests against stateful > query verifying the state value (especially flatMapGroupsWithState), 2) > gather more context on the status when an incident occurs, especially for > incorrect output. > > *SPIP*: > https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing > *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 > > Looking forward to your feedback! > > Thanks, > Jungtaek Lim (HeartSaVioR) > > ps. The scope of the project is narrowed to the reader in this SPIP, > since the writer requires us to consider more cases. We are planning on > it. > >> >> -- >> Bartosz Konieczny >> freelance data engineer >> https://www.waitingforcode.com >> https://github.com/bartosz25/ >> https://twitter.com/waitingforcode >> >>
Re: [DISCUSS] SPIP: State Data Source - Reader
Hi Jungtaek, Thanks for putting this together. +1 from me and looks good overall. Posted some minor comments/questions to the doc. Thanks, Anish On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny wrote: > Thank you, Jungtaek, for your answers! It's clear now. > > +1 for me. It seems like a prerequisite for further ops-related > improvements for the state store management. I mean especially here the > state rebalancing that could rely on this read+write state store API. I > don't mean here the dynamic state rebalancing that could probably be > implemented with a lower latency directly in the stateful API. Instead I'm > thinking more of an offline job to rebalance the state and later restart > the stateful pipeline with the changed number of shuffle partitions. > > Best, > Bartosz. > > On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim > wrote: > >> bump for better reach >> >> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Sorry, please use this link instead for SPIP doc: >>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >>> >>> >>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Hi dev, I'd like to start a discussion on "State Data Source - Reader". This proposal aims to introduce a new data source "statestore" which enables reading the state rows from existing checkpoint via offline (batch) query. This will enable users to 1) create unit tests against stateful query verifying the state value (especially flatMapGroupsWithState), 2) gather more context on the status when an incident occurs, especially for incorrect output. *SPIP*: https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 Looking forward to your feedback! Thanks, Jungtaek Lim (HeartSaVioR) ps. The scope of the project is narrowed to the reader in this SPIP, since the writer requires us to consider more cases. We are planning on it. >>> > > -- > Bartosz Konieczny > freelance data engineer > https://www.waitingforcode.com > https://github.com/bartosz25/ > https://twitter.com/waitingforcode > >
Re: [DISCUSS] SPIP: State Data Source - Reader
Thank you, Jungtaek, for your answers! It's clear now. +1 for me. It seems like a prerequisite for further ops-related improvements for the state store management. I mean especially here the state rebalancing that could rely on this read+write state store API. I don't mean here the dynamic state rebalancing that could probably be implemented with a lower latency directly in the stateful API. Instead I'm thinking more of an offline job to rebalance the state and later restart the stateful pipeline with the changed number of shuffle partitions. Best, Bartosz. On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim wrote: > bump for better reach > > On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim > wrote: > >> Sorry, please use this link instead for SPIP doc: >> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >> >> >> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi dev, >>> >>> I'd like to start a discussion on "State Data Source - Reader". >>> >>> This proposal aims to introduce a new data source "statestore" which >>> enables reading the state rows from existing checkpoint via offline (batch) >>> query. This will enable users to 1) create unit tests against stateful >>> query verifying the state value (especially flatMapGroupsWithState), 2) >>> gather more context on the status when an incident occurs, especially for >>> incorrect output. >>> >>> *SPIP*: >>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >>> >>> Looking forward to your feedback! >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> ps. The scope of the project is narrowed to the reader in this SPIP, >>> since the writer requires us to consider more cases. We are planning on it. >>> >> -- Bartosz Konieczny freelance data engineer https://www.waitingforcode.com https://github.com/bartosz25/ https://twitter.com/waitingforcode
Re: [DISCUSS] SPIP: State Data Source - Reader
bump for better reach On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim wrote: > Sorry, please use this link instead for SPIP doc: > https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing > > > On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim > wrote: > >> Hi dev, >> >> I'd like to start a discussion on "State Data Source - Reader". >> >> This proposal aims to introduce a new data source "statestore" which >> enables reading the state rows from existing checkpoint via offline (batch) >> query. This will enable users to 1) create unit tests against stateful >> query verifying the state value (especially flatMapGroupsWithState), 2) >> gather more context on the status when an incident occurs, especially for >> incorrect output. >> >> *SPIP*: >> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >> >> Looking forward to your feedback! >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> ps. The scope of the project is narrowed to the reader in this SPIP, >> since the writer requires us to consider more cases. We are planning on it. >> >
Re: [DISCUSS] SPIP: State Data Source - Reader
Sorry, please use this link instead for SPIP doc: https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim wrote: > Hi dev, > > I'd like to start a discussion on "State Data Source - Reader". > > This proposal aims to introduce a new data source "statestore" which > enables reading the state rows from existing checkpoint via offline (batch) > query. This will enable users to 1) create unit tests against stateful > query verifying the state value (especially flatMapGroupsWithState), 2) > gather more context on the status when an incident occurs, especially for > incorrect output. > > *SPIP*: > https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing > *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 > > Looking forward to your feedback! > > Thanks, > Jungtaek Lim (HeartSaVioR) > > ps. The scope of the project is narrowed to the reader in this SPIP, since > the writer requires us to consider more cases. We are planning on it. >
[DISCUSS] SPIP: State Data Source - Reader
Hi dev, I'd like to start a discussion on "State Data Source - Reader". This proposal aims to introduce a new data source "statestore" which enables reading the state rows from existing checkpoint via offline (batch) query. This will enable users to 1) create unit tests against stateful query verifying the state value (especially flatMapGroupsWithState), 2) gather more context on the status when an incident occurs, especially for incorrect output. *SPIP*: https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 Looking forward to your feedback! Thanks, Jungtaek Lim (HeartSaVioR) ps. The scope of the project is narrowed to the reader in this SPIP, since the writer requires us to consider more cases. We are planning on it.