Re: [DISCUSS] SPIP: State Data Source - Reader
Thanks Bartosz and Anish for your support! I'll wait for a couple more days to see whether we can hear more voices on this. We could probably look for initiating a VOTE thread if there is no objection. On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar < anish.shrigonde...@databricks.com> wrote: > Hi Jungtaek, > > Thanks for putting this together. +1 from me and looks good overall. > Posted some minor comments/questions to the doc. > > Thanks, > Anish > > On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < > bartkoniec...@gmail.com> wrote: > >> Thank you, Jungtaek, for your answers! It's clear now. >> >> +1 for me. It seems like a prerequisite for further ops-related >> improvements for the state store management. I mean especially here the >> state rebalancing that could rely on this read+write state store API. I >> don't mean here the dynamic state rebalancing that could probably be >> implemented with a lower latency directly in the stateful API. Instead I'm >> thinking more of an offline job to rebalance the state and later restart >> the stateful pipeline with the changed number of shuffle partitions. >> >> Best, >> Bartosz. >> >> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> bump for better reach >>> >>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Sorry, please use this link instead for SPIP doc: https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Hi dev, > > I'd like to start a discussion on "State Data Source - Reader". > > This proposal aims to introduce a new data source "statestore" which > enables reading the state rows from existing checkpoint via offline > (batch) > query. This will enable users to 1) create unit tests against stateful > query verifying the state value (especially flatMapGroupsWithState), 2) > gather more context on the status when an incident occurs, especially for > incorrect output. > > *SPIP*: > https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing > *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 > > Looking forward to your feedback! > > Thanks, > Jungtaek Lim (HeartSaVioR) > > ps. The scope of the project is narrowed to the reader in this SPIP, > since the writer requires us to consider more cases. We are planning on > it. > >> >> -- >> Bartosz Konieczny >> freelance data engineer >> https://www.waitingforcode.com >> https://github.com/bartosz25/ >> https://twitter.com/waitingforcode >> >>
Re: [DISCUSS] SPIP: State Data Source - Reader
Hi Jungtaek, Thanks for putting this together. +1 from me and looks good overall. Posted some minor comments/questions to the doc. Thanks, Anish On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny wrote: > Thank you, Jungtaek, for your answers! It's clear now. > > +1 for me. It seems like a prerequisite for further ops-related > improvements for the state store management. I mean especially here the > state rebalancing that could rely on this read+write state store API. I > don't mean here the dynamic state rebalancing that could probably be > implemented with a lower latency directly in the stateful API. Instead I'm > thinking more of an offline job to rebalance the state and later restart > the stateful pipeline with the changed number of shuffle partitions. > > Best, > Bartosz. > > On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim > wrote: > >> bump for better reach >> >> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Sorry, please use this link instead for SPIP doc: >>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >>> >>> >>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Hi dev, I'd like to start a discussion on "State Data Source - Reader". This proposal aims to introduce a new data source "statestore" which enables reading the state rows from existing checkpoint via offline (batch) query. This will enable users to 1) create unit tests against stateful query verifying the state value (especially flatMapGroupsWithState), 2) gather more context on the status when an incident occurs, especially for incorrect output. *SPIP*: https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 Looking forward to your feedback! Thanks, Jungtaek Lim (HeartSaVioR) ps. The scope of the project is narrowed to the reader in this SPIP, since the writer requires us to consider more cases. We are planning on it. >>> > > -- > Bartosz Konieczny > freelance data engineer > https://www.waitingforcode.com > https://github.com/bartosz25/ > https://twitter.com/waitingforcode > >
Re: [DISCUSS] SPIP: State Data Source - Reader
Thank you, Jungtaek, for your answers! It's clear now. +1 for me. It seems like a prerequisite for further ops-related improvements for the state store management. I mean especially here the state rebalancing that could rely on this read+write state store API. I don't mean here the dynamic state rebalancing that could probably be implemented with a lower latency directly in the stateful API. Instead I'm thinking more of an offline job to rebalance the state and later restart the stateful pipeline with the changed number of shuffle partitions. Best, Bartosz. On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim wrote: > bump for better reach > > On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim > wrote: > >> Sorry, please use this link instead for SPIP doc: >> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >> >> >> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi dev, >>> >>> I'd like to start a discussion on "State Data Source - Reader". >>> >>> This proposal aims to introduce a new data source "statestore" which >>> enables reading the state rows from existing checkpoint via offline (batch) >>> query. This will enable users to 1) create unit tests against stateful >>> query verifying the state value (especially flatMapGroupsWithState), 2) >>> gather more context on the status when an incident occurs, especially for >>> incorrect output. >>> >>> *SPIP*: >>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >>> >>> Looking forward to your feedback! >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> ps. The scope of the project is narrowed to the reader in this SPIP, >>> since the writer requires us to consider more cases. We are planning on it. >>> >> -- Bartosz Konieczny freelance data engineer https://www.waitingforcode.com https://github.com/bartosz25/ https://twitter.com/waitingforcode
Re: [DISCUSS] SPIP: State Data Source - Reader
bump for better reach On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim wrote: > Sorry, please use this link instead for SPIP doc: > https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing > > > On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim > wrote: > >> Hi dev, >> >> I'd like to start a discussion on "State Data Source - Reader". >> >> This proposal aims to introduce a new data source "statestore" which >> enables reading the state rows from existing checkpoint via offline (batch) >> query. This will enable users to 1) create unit tests against stateful >> query verifying the state value (especially flatMapGroupsWithState), 2) >> gather more context on the status when an incident occurs, especially for >> incorrect output. >> >> *SPIP*: >> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >> >> Looking forward to your feedback! >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> ps. The scope of the project is narrowed to the reader in this SPIP, >> since the writer requires us to consider more cases. We are planning on it. >> >