Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Thanks all for the support! Great to see we drive the discussion for Structured Streaming and have sufficient support. We would like to move forward with the vote thread. Please also participate in the vote. Thanks again! On Thu, Dec 1, 2022 at 10:04 AM Wenchen Fan wrote: > +1 to improve the widely used micro-batch mode first. > > On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon wrote: > >> +1 >> >> On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: >> >>> +1 >>> >>> This is exciting. I agree with Jerry that this SPIP and continuous >>> processing are orthogonal. This SPIP itself would be a great improvement >>> and impact most Structured Streaming users. >>> >>> Best Regards, >>> Shixiong >>> >>> >>> On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan >>> wrote: >>> Thanks for all the clarifications and details Jerry, Jungtaek :-) This looks like an exciting improvement to Structured Streaming - looking forward to it becoming part of Apache Spark ! Regards, Mridul On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng wrote: > Hi all, > > I will add my two cents. Improving the Microbatch execution engine > does not prevent us from working/improving on the continuous execution > engine in the future. These are orthogonal issues. This new mode I am > proposing in the microbatch execution engine intends to lower latency of > this execution engine that most people use today. We can view it as an > incremental improvement on the existing engine. I see the continuous > execution engine as a partially completed re-write of spark streaming and > may serve as the "future" engine powering Spark Streaming. Improving the > "current" engine does not mean we cannot work on a "future" engine. These > two are not mutually exclusive. I would like to focus the discussion on > the > merits of this feature in regards to the current micro-batch execution > engine and not a discussion on the future of continuous execution engine. > > Best, > > Jerry > > > On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Hi Mridul, >> >> I'd like to make clear to avoid any misunderstanding - the decision >> was not led by me. (I'm just a one of engineers in the team. Not even >> TL.) >> As you see the direction, there was an internal consensus to not revisit >> the continuous mode. There are various reasons, which I think we know >> already. You seem to remember I have raised concerns about continuous >> mode, >> but have you indicated that it was even over 2 years ago? I still see no >> traction around the project. The main reason I abandoned the discussion >> was >> due to promising effort on integrating push based shuffle into continuous >> mode to achieve shuffle, but no effort has been made so far. >> >> The goal of this SPIP is to have an alternative approach dealing with >> same workload, given that we no longer have confidence of success of >> continuous mode. But I also want to make clear that deprecating and >> eventually retiring continuous mode is not a goal of this project. If >> that >> happens eventually, that would be a side-effect. Someone may have >> concerns >> that we have two different projects aiming for similar thing, but I'd >> rather see both projects having competition. If anyone willing to improve >> continuous mode can start making the effort right now. This SPIP does not >> block it. >> >> >> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan >> wrote: >> >>> >>> Hi Jungtaek, >>> >>> Given the goal of the SPIP is reducing latency for stateless apps, >>> and should reasonably fit continuous mode design goals, it feels odd to >>> not >>> support it fin the proposal. >>> >>> I know you have raised concerns about continuous mode in past as >>> well in dev@ list, and we are further ignoring it in this proposal >>> (and possibly other enhancements in past few releases). >>> >>> Do you want to revisit the discussion to support it and propose a >>> vote on that ? And move it to deprecated ? >>> >>> I am much more comfortable not supporting this SPIP for CM if it was >>> deprecated. >>> >>> Thoughts ? >>> >>> Regards, >>> Mridul >>> >>> >>> >>> >>> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng < >>> jerry.boyang.p...@gmail.com> wrote: >>> Jungtaek, Thanks for taking up the role to shepard this SPIP! Thank you for also chiming in on your thoughts concerning the continuous mode! Best, Jerry On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Jus
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
+1 to improve the widely used micro-batch mode first. On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon wrote: > +1 > > On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: > >> +1 >> >> This is exciting. I agree with Jerry that this SPIP and continuous >> processing are orthogonal. This SPIP itself would be a great improvement >> and impact most Structured Streaming users. >> >> Best Regards, >> Shixiong >> >> >> On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan >> wrote: >> >>> >>> Thanks for all the clarifications and details Jerry, Jungtaek :-) >>> This looks like an exciting improvement to Structured Streaming - >>> looking forward to it becoming part of Apache Spark ! >>> >>> Regards, >>> Mridul >>> >>> >>> On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng >>> wrote: >>> Hi all, I will add my two cents. Improving the Microbatch execution engine does not prevent us from working/improving on the continuous execution engine in the future. These are orthogonal issues. This new mode I am proposing in the microbatch execution engine intends to lower latency of this execution engine that most people use today. We can view it as an incremental improvement on the existing engine. I see the continuous execution engine as a partially completed re-write of spark streaming and may serve as the "future" engine powering Spark Streaming. Improving the "current" engine does not mean we cannot work on a "future" engine. These two are not mutually exclusive. I would like to focus the discussion on the merits of this feature in regards to the current micro-batch execution engine and not a discussion on the future of continuous execution engine. Best, Jerry On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Hi Mridul, > > I'd like to make clear to avoid any misunderstanding - the decision > was not led by me. (I'm just a one of engineers in the team. Not even TL.) > As you see the direction, there was an internal consensus to not revisit > the continuous mode. There are various reasons, which I think we know > already. You seem to remember I have raised concerns about continuous > mode, > but have you indicated that it was even over 2 years ago? I still see no > traction around the project. The main reason I abandoned the discussion > was > due to promising effort on integrating push based shuffle into continuous > mode to achieve shuffle, but no effort has been made so far. > > The goal of this SPIP is to have an alternative approach dealing with > same workload, given that we no longer have confidence of success of > continuous mode. But I also want to make clear that deprecating and > eventually retiring continuous mode is not a goal of this project. If that > happens eventually, that would be a side-effect. Someone may have concerns > that we have two different projects aiming for similar thing, but I'd > rather see both projects having competition. If anyone willing to improve > continuous mode can start making the effort right now. This SPIP does not > block it. > > > On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan > wrote: > >> >> Hi Jungtaek, >> >> Given the goal of the SPIP is reducing latency for stateless apps, >> and should reasonably fit continuous mode design goals, it feels odd to >> not >> support it fin the proposal. >> >> I know you have raised concerns about continuous mode in past as well >> in dev@ list, and we are further ignoring it in this proposal (and >> possibly other enhancements in past few releases). >> >> Do you want to revisit the discussion to support it and propose a >> vote on that ? And move it to deprecated ? >> >> I am much more comfortable not supporting this SPIP for CM if it was >> deprecated. >> >> Thoughts ? >> >> Regards, >> Mridul >> >> >> >> >> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng < >> jerry.boyang.p...@gmail.com> wrote: >> >>> Jungtaek, >>> >>> Thanks for taking up the role to shepard this SPIP! Thank you for >>> also chiming in on your thoughts concerning the continuous mode! >>> >>> Best, >>> >>> Jerry >>> >>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Just FYI, I'm shepherding this SPIP project. I think the major meta question would be, "why don't we spend effort on continuous mode rather than initiating another feature aiming for the same workload?". Jerry already updated the doc to answer the question, but I can also share my thoughts about it. I feel like the current "continuous mode" is a niche so
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
+1 On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: > +1 > > This is exciting. I agree with Jerry that this SPIP and continuous > processing are orthogonal. This SPIP itself would be a great improvement > and impact most Structured Streaming users. > > Best Regards, > Shixiong > > > On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan > wrote: > >> >> Thanks for all the clarifications and details Jerry, Jungtaek :-) >> This looks like an exciting improvement to Structured Streaming - looking >> forward to it becoming part of Apache Spark ! >> >> Regards, >> Mridul >> >> >> On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng >> wrote: >> >>> Hi all, >>> >>> I will add my two cents. Improving the Microbatch execution engine does >>> not prevent us from working/improving on the continuous execution engine in >>> the future. These are orthogonal issues. This new mode I am proposing in >>> the microbatch execution engine intends to lower latency of this execution >>> engine that most people use today. We can view it as an incremental >>> improvement on the existing engine. I see the continuous execution engine >>> as a partially completed re-write of spark streaming and may serve as the >>> "future" engine powering Spark Streaming. Improving the "current" engine >>> does not mean we cannot work on a "future" engine. These two are not >>> mutually exclusive. I would like to focus the discussion on the merits of >>> this feature in regards to the current micro-batch execution engine and not >>> a discussion on the future of continuous execution engine. >>> >>> Best, >>> >>> Jerry >>> >>> >>> On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Hi Mridul, I'd like to make clear to avoid any misunderstanding - the decision was not led by me. (I'm just a one of engineers in the team. Not even TL.) As you see the direction, there was an internal consensus to not revisit the continuous mode. There are various reasons, which I think we know already. You seem to remember I have raised concerns about continuous mode, but have you indicated that it was even over 2 years ago? I still see no traction around the project. The main reason I abandoned the discussion was due to promising effort on integrating push based shuffle into continuous mode to achieve shuffle, but no effort has been made so far. The goal of this SPIP is to have an alternative approach dealing with same workload, given that we no longer have confidence of success of continuous mode. But I also want to make clear that deprecating and eventually retiring continuous mode is not a goal of this project. If that happens eventually, that would be a side-effect. Someone may have concerns that we have two different projects aiming for similar thing, but I'd rather see both projects having competition. If anyone willing to improve continuous mode can start making the effort right now. This SPIP does not block it. On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan wrote: > > Hi Jungtaek, > > Given the goal of the SPIP is reducing latency for stateless apps, > and should reasonably fit continuous mode design goals, it feels odd to > not > support it fin the proposal. > > I know you have raised concerns about continuous mode in past as well > in dev@ list, and we are further ignoring it in this proposal (and > possibly other enhancements in past few releases). > > Do you want to revisit the discussion to support it and propose a vote > on that ? And move it to deprecated ? > > I am much more comfortable not supporting this SPIP for CM if it was > deprecated. > > Thoughts ? > > Regards, > Mridul > > > > > On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng < > jerry.boyang.p...@gmail.com> wrote: > >> Jungtaek, >> >> Thanks for taking up the role to shepard this SPIP! Thank you for >> also chiming in on your thoughts concerning the continuous mode! >> >> Best, >> >> Jerry >> >> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Just FYI, I'm shepherding this SPIP project. >>> >>> I think the major meta question would be, "why don't we spend >>> effort on continuous mode rather than initiating another feature aiming >>> for >>> the same workload?". Jerry already updated the doc to answer the >>> question, >>> but I can also share my thoughts about it. >>> >>> I feel like the current "continuous mode" is a niche solution. (It's >>> not to blame. If you have to deal with such workload but can't rewrite >>> the >>> underlying engine from scratch, then there are really few options.) >>> Since the implementation went with a workaround to implement wh
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
+1 This is exciting. I agree with Jerry that this SPIP and continuous processing are orthogonal. This SPIP itself would be a great improvement and impact most Structured Streaming users. Best Regards, Shixiong On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan wrote: > > Thanks for all the clarifications and details Jerry, Jungtaek :-) > This looks like an exciting improvement to Structured Streaming - looking > forward to it becoming part of Apache Spark ! > > Regards, > Mridul > > > On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng > wrote: > >> Hi all, >> >> I will add my two cents. Improving the Microbatch execution engine does >> not prevent us from working/improving on the continuous execution engine in >> the future. These are orthogonal issues. This new mode I am proposing in >> the microbatch execution engine intends to lower latency of this execution >> engine that most people use today. We can view it as an incremental >> improvement on the existing engine. I see the continuous execution engine >> as a partially completed re-write of spark streaming and may serve as the >> "future" engine powering Spark Streaming. Improving the "current" engine >> does not mean we cannot work on a "future" engine. These two are not >> mutually exclusive. I would like to focus the discussion on the merits of >> this feature in regards to the current micro-batch execution engine and not >> a discussion on the future of continuous execution engine. >> >> Best, >> >> Jerry >> >> >> On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi Mridul, >>> >>> I'd like to make clear to avoid any misunderstanding - the decision was >>> not led by me. (I'm just a one of engineers in the team. Not even TL.) As >>> you see the direction, there was an internal consensus to not revisit the >>> continuous mode. There are various reasons, which I think we know already. >>> You seem to remember I have raised concerns about continuous mode, but have >>> you indicated that it was even over 2 years ago? I still see no traction >>> around the project. The main reason I abandoned the discussion was due to >>> promising effort on integrating push based shuffle into continuous mode to >>> achieve shuffle, but no effort has been made so far. >>> >>> The goal of this SPIP is to have an alternative approach dealing with >>> same workload, given that we no longer have confidence of success of >>> continuous mode. But I also want to make clear that deprecating and >>> eventually retiring continuous mode is not a goal of this project. If that >>> happens eventually, that would be a side-effect. Someone may have concerns >>> that we have two different projects aiming for similar thing, but I'd >>> rather see both projects having competition. If anyone willing to improve >>> continuous mode can start making the effort right now. This SPIP does not >>> block it. >>> >>> >>> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan >>> wrote: >>> Hi Jungtaek, Given the goal of the SPIP is reducing latency for stateless apps, and should reasonably fit continuous mode design goals, it feels odd to not support it fin the proposal. I know you have raised concerns about continuous mode in past as well in dev@ list, and we are further ignoring it in this proposal (and possibly other enhancements in past few releases). Do you want to revisit the discussion to support it and propose a vote on that ? And move it to deprecated ? I am much more comfortable not supporting this SPIP for CM if it was deprecated. Thoughts ? Regards, Mridul On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng wrote: > Jungtaek, > > Thanks for taking up the role to shepard this SPIP! Thank you for > also chiming in on your thoughts concerning the continuous mode! > > Best, > > Jerry > > On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Just FYI, I'm shepherding this SPIP project. >> >> I think the major meta question would be, "why don't we spend >> effort on continuous mode rather than initiating another feature aiming >> for >> the same workload?". Jerry already updated the doc to answer the >> question, >> but I can also share my thoughts about it. >> >> I feel like the current "continuous mode" is a niche solution. (It's >> not to blame. If you have to deal with such workload but can't rewrite >> the >> underlying engine from scratch, then there are really few options.) >> Since the implementation went with a workaround to implement which >> the architecture does not support natively e.g. distributed snapshot, it >> gets quite tricky on maintaining and expanding the project. It also >> requires 3rd parties to implement a separate source and sink >
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Thanks for all the clarifications and details Jerry, Jungtaek :-) This looks like an exciting improvement to Structured Streaming - looking forward to it becoming part of Apache Spark ! Regards, Mridul On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng wrote: > Hi all, > > I will add my two cents. Improving the Microbatch execution engine does > not prevent us from working/improving on the continuous execution engine in > the future. These are orthogonal issues. This new mode I am proposing in > the microbatch execution engine intends to lower latency of this execution > engine that most people use today. We can view it as an incremental > improvement on the existing engine. I see the continuous execution engine > as a partially completed re-write of spark streaming and may serve as the > "future" engine powering Spark Streaming. Improving the "current" engine > does not mean we cannot work on a "future" engine. These two are not > mutually exclusive. I would like to focus the discussion on the merits of > this feature in regards to the current micro-batch execution engine and not > a discussion on the future of continuous execution engine. > > Best, > > Jerry > > > On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim > wrote: > >> Hi Mridul, >> >> I'd like to make clear to avoid any misunderstanding - the decision was >> not led by me. (I'm just a one of engineers in the team. Not even TL.) As >> you see the direction, there was an internal consensus to not revisit the >> continuous mode. There are various reasons, which I think we know already. >> You seem to remember I have raised concerns about continuous mode, but have >> you indicated that it was even over 2 years ago? I still see no traction >> around the project. The main reason I abandoned the discussion was due to >> promising effort on integrating push based shuffle into continuous mode to >> achieve shuffle, but no effort has been made so far. >> >> The goal of this SPIP is to have an alternative approach dealing with >> same workload, given that we no longer have confidence of success of >> continuous mode. But I also want to make clear that deprecating and >> eventually retiring continuous mode is not a goal of this project. If that >> happens eventually, that would be a side-effect. Someone may have concerns >> that we have two different projects aiming for similar thing, but I'd >> rather see both projects having competition. If anyone willing to improve >> continuous mode can start making the effort right now. This SPIP does not >> block it. >> >> >> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan >> wrote: >> >>> >>> Hi Jungtaek, >>> >>> Given the goal of the SPIP is reducing latency for stateless apps, and >>> should reasonably fit continuous mode design goals, it feels odd to not >>> support it fin the proposal. >>> >>> I know you have raised concerns about continuous mode in past as well in >>> dev@ list, and we are further ignoring it in this proposal (and >>> possibly other enhancements in past few releases). >>> >>> Do you want to revisit the discussion to support it and propose a vote >>> on that ? And move it to deprecated ? >>> >>> I am much more comfortable not supporting this SPIP for CM if it was >>> deprecated. >>> >>> Thoughts ? >>> >>> Regards, >>> Mridul >>> >>> >>> >>> >>> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng >>> wrote: >>> Jungtaek, Thanks for taking up the role to shepard this SPIP! Thank you for also chiming in on your thoughts concerning the continuous mode! Best, Jerry On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < kabhwan.opensou...@gmail.com> wrote: > Just FYI, I'm shepherding this SPIP project. > > I think the major meta question would be, "why don't we spend > effort on continuous mode rather than initiating another feature aiming > for > the same workload?". Jerry already updated the doc to answer the question, > but I can also share my thoughts about it. > > I feel like the current "continuous mode" is a niche solution. (It's > not to blame. If you have to deal with such workload but can't rewrite the > underlying engine from scratch, then there are really few options.) > Since the implementation went with a workaround to implement which the > architecture does not support natively e.g. distributed snapshot, it gets > quite tricky on maintaining and expanding the project. It also requires > 3rd > parties to implement a separate source and sink implementation, which I'm > not sure how many 3rd parties actually followed so far. > > Eventually, "continuous mode" becomes an area no one in the active > community knows the details and has willingness to maintain. I wouldn't > say > we are confident to remove the tag on "experimental", although the feature > has been shipped for years. It was introduced in Spark 2.3, surprising > enough? > >
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Hi all, I will add my two cents. Improving the Microbatch execution engine does not prevent us from working/improving on the continuous execution engine in the future. These are orthogonal issues. This new mode I am proposing in the microbatch execution engine intends to lower latency of this execution engine that most people use today. We can view it as an incremental improvement on the existing engine. I see the continuous execution engine as a partially completed re-write of spark streaming and may serve as the "future" engine powering Spark Streaming. Improving the "current" engine does not mean we cannot work on a "future" engine. These two are not mutually exclusive. I would like to focus the discussion on the merits of this feature in regards to the current micro-batch execution engine and not a discussion on the future of continuous execution engine. Best, Jerry On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim wrote: > Hi Mridul, > > I'd like to make clear to avoid any misunderstanding - the decision was > not led by me. (I'm just a one of engineers in the team. Not even TL.) As > you see the direction, there was an internal consensus to not revisit the > continuous mode. There are various reasons, which I think we know already. > You seem to remember I have raised concerns about continuous mode, but have > you indicated that it was even over 2 years ago? I still see no traction > around the project. The main reason I abandoned the discussion was due to > promising effort on integrating push based shuffle into continuous mode to > achieve shuffle, but no effort has been made so far. > > The goal of this SPIP is to have an alternative approach dealing with same > workload, given that we no longer have confidence of success of continuous > mode. But I also want to make clear that deprecating and eventually > retiring continuous mode is not a goal of this project. If that happens > eventually, that would be a side-effect. Someone may have concerns that we > have two different projects aiming for similar thing, but I'd rather see > both projects having competition. If anyone willing to improve continuous > mode can start making the effort right now. This SPIP does not block it. > > > On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan > wrote: > >> >> Hi Jungtaek, >> >> Given the goal of the SPIP is reducing latency for stateless apps, and >> should reasonably fit continuous mode design goals, it feels odd to not >> support it fin the proposal. >> >> I know you have raised concerns about continuous mode in past as well in >> dev@ list, and we are further ignoring it in this proposal (and possibly >> other enhancements in past few releases). >> >> Do you want to revisit the discussion to support it and propose a vote on >> that ? And move it to deprecated ? >> >> I am much more comfortable not supporting this SPIP for CM if it was >> deprecated. >> >> Thoughts ? >> >> Regards, >> Mridul >> >> >> >> >> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng >> wrote: >> >>> Jungtaek, >>> >>> Thanks for taking up the role to shepard this SPIP! Thank you for also >>> chiming in on your thoughts concerning the continuous mode! >>> >>> Best, >>> >>> Jerry >>> >>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> Just FYI, I'm shepherding this SPIP project. I think the major meta question would be, "why don't we spend effort on continuous mode rather than initiating another feature aiming for the same workload?". Jerry already updated the doc to answer the question, but I can also share my thoughts about it. I feel like the current "continuous mode" is a niche solution. (It's not to blame. If you have to deal with such workload but can't rewrite the underlying engine from scratch, then there are really few options.) Since the implementation went with a workaround to implement which the architecture does not support natively e.g. distributed snapshot, it gets quite tricky on maintaining and expanding the project. It also requires 3rd parties to implement a separate source and sink implementation, which I'm not sure how many 3rd parties actually followed so far. Eventually, "continuous mode" becomes an area no one in the active community knows the details and has willingness to maintain. I wouldn't say we are confident to remove the tag on "experimental", although the feature has been shipped for years. It was introduced in Spark 2.3, surprising enough? We went back and thought about the approach from scratch. Jerry came up with the idea which leverages existing microbatch execution, hence relatively stable and no need to require 3rd parties to support another mode. It adds complexity against microbatch execution but it's a lot less complicated compared to the existing continuous mode. Definitely quite less than creating a new rec
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Hi Mridul, I'd like to make clear to avoid any misunderstanding - the decision was not led by me. (I'm just a one of engineers in the team. Not even TL.) As you see the direction, there was an internal consensus to not revisit the continuous mode. There are various reasons, which I think we know already. You seem to remember I have raised concerns about continuous mode, but have you indicated that it was even over 2 years ago? I still see no traction around the project. The main reason I abandoned the discussion was due to promising effort on integrating push based shuffle into continuous mode to achieve shuffle, but no effort has been made so far. The goal of this SPIP is to have an alternative approach dealing with same workload, given that we no longer have confidence of success of continuous mode. But I also want to make clear that deprecating and eventually retiring continuous mode is not a goal of this project. If that happens eventually, that would be a side-effect. Someone may have concerns that we have two different projects aiming for similar thing, but I'd rather see both projects having competition. If anyone willing to improve continuous mode can start making the effort right now. This SPIP does not block it. On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan wrote: > > Hi Jungtaek, > > Given the goal of the SPIP is reducing latency for stateless apps, and > should reasonably fit continuous mode design goals, it feels odd to not > support it fin the proposal. > > I know you have raised concerns about continuous mode in past as well in > dev@ list, and we are further ignoring it in this proposal (and possibly > other enhancements in past few releases). > > Do you want to revisit the discussion to support it and propose a vote on > that ? And move it to deprecated ? > > I am much more comfortable not supporting this SPIP for CM if it was > deprecated. > > Thoughts ? > > Regards, > Mridul > > > > > On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng > wrote: > >> Jungtaek, >> >> Thanks for taking up the role to shepard this SPIP! Thank you for also >> chiming in on your thoughts concerning the continuous mode! >> >> Best, >> >> Jerry >> >> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Just FYI, I'm shepherding this SPIP project. >>> >>> I think the major meta question would be, "why don't we spend effort on >>> continuous mode rather than initiating another feature aiming for the >>> same workload?". Jerry already updated the doc to answer the question, but >>> I can also share my thoughts about it. >>> >>> I feel like the current "continuous mode" is a niche solution. (It's not >>> to blame. If you have to deal with such workload but can't rewrite the >>> underlying engine from scratch, then there are really few options.) >>> Since the implementation went with a workaround to implement which the >>> architecture does not support natively e.g. distributed snapshot, it gets >>> quite tricky on maintaining and expanding the project. It also requires 3rd >>> parties to implement a separate source and sink implementation, which I'm >>> not sure how many 3rd parties actually followed so far. >>> >>> Eventually, "continuous mode" becomes an area no one in the active >>> community knows the details and has willingness to maintain. I wouldn't say >>> we are confident to remove the tag on "experimental", although the feature >>> has been shipped for years. It was introduced in Spark 2.3, surprising >>> enough? >>> >>> We went back and thought about the approach from scratch. Jerry came up >>> with the idea which leverages existing microbatch execution, hence >>> relatively stable and no need to require 3rd parties to support another >>> mode. It adds complexity against microbatch execution but it's a lot less >>> complicated compared to the existing continuous mode. Definitely quite less >>> than creating a new record-to-record engine from scratch. >>> >>> That said, we want to propose and move forward with the new approach. >>> >>> ps. Eventually we could probably discuss retiring continuous mode if the >>> new approach gets accepted and eventually considered as a stable one after >>> several minor releases. That's just me. >>> >>> On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng >>> wrote: >>> Hi all, I would like to start the discussion for a SPIP, Asynchronous Offset Management in Structured Streaming. The high level summary of the SPIP is that currently in Structured Streaming we perform a couple of offset management operations for progress tracking purposes synchronously on the critical path which can contribute significantly to processing latency. If we were to make these operations asynchronous and less frequent we can dramatically improve latency for certain types of workloads. I have put together a SPIP to implement such a mechanism. Please take a look! SPIP Jira: https://issues.
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Hi Jungtaek, Given the goal of the SPIP is reducing latency for stateless apps, and should reasonably fit continuous mode design goals, it feels odd to not support it fin the proposal. I know you have raised concerns about continuous mode in past as well in dev@ list, and we are further ignoring it in this proposal (and possibly other enhancements in past few releases). Do you want to revisit the discussion to support it and propose a vote on that ? And move it to deprecated ? I am much more comfortable not supporting this SPIP for CM if it was deprecated. Thoughts ? Regards, Mridul On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng wrote: > Jungtaek, > > Thanks for taking up the role to shepard this SPIP! Thank you for also > chiming in on your thoughts concerning the continuous mode! > > Best, > > Jerry > > On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim > wrote: > >> Just FYI, I'm shepherding this SPIP project. >> >> I think the major meta question would be, "why don't we spend effort on >> continuous mode rather than initiating another feature aiming for the >> same workload?". Jerry already updated the doc to answer the question, but >> I can also share my thoughts about it. >> >> I feel like the current "continuous mode" is a niche solution. (It's not >> to blame. If you have to deal with such workload but can't rewrite the >> underlying engine from scratch, then there are really few options.) >> Since the implementation went with a workaround to implement which the >> architecture does not support natively e.g. distributed snapshot, it gets >> quite tricky on maintaining and expanding the project. It also requires 3rd >> parties to implement a separate source and sink implementation, which I'm >> not sure how many 3rd parties actually followed so far. >> >> Eventually, "continuous mode" becomes an area no one in the active >> community knows the details and has willingness to maintain. I wouldn't say >> we are confident to remove the tag on "experimental", although the feature >> has been shipped for years. It was introduced in Spark 2.3, surprising >> enough? >> >> We went back and thought about the approach from scratch. Jerry came up >> with the idea which leverages existing microbatch execution, hence >> relatively stable and no need to require 3rd parties to support another >> mode. It adds complexity against microbatch execution but it's a lot less >> complicated compared to the existing continuous mode. Definitely quite less >> than creating a new record-to-record engine from scratch. >> >> That said, we want to propose and move forward with the new approach. >> >> ps. Eventually we could probably discuss retiring continuous mode if the >> new approach gets accepted and eventually considered as a stable one after >> several minor releases. That's just me. >> >> On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng >> wrote: >> >>> Hi all, >>> >>> I would like to start the discussion for a SPIP, Asynchronous Offset >>> Management in Structured Streaming. The high level summary of the SPIP is >>> that currently in Structured Streaming we perform a couple of offset >>> management operations for progress tracking purposes synchronously on the >>> critical path which can contribute significantly to processing latency. If >>> we were to make these operations asynchronous and less frequent we can >>> dramatically improve latency for certain types of workloads. >>> >>> I have put together a SPIP to implement such a mechanism. Please take a >>> look! >>> >>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591 >>> >>> SPIP doc: >>> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing >>> >>> >>> Best, >>> >>> Jerry >>> >>
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Jungtaek, Thanks for taking up the role to shepard this SPIP! Thank you for also chiming in on your thoughts concerning the continuous mode! Best, Jerry On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim wrote: > Just FYI, I'm shepherding this SPIP project. > > I think the major meta question would be, "why don't we spend effort on > continuous mode rather than initiating another feature aiming for the > same workload?". Jerry already updated the doc to answer the question, but > I can also share my thoughts about it. > > I feel like the current "continuous mode" is a niche solution. (It's not > to blame. If you have to deal with such workload but can't rewrite the > underlying engine from scratch, then there are really few options.) > Since the implementation went with a workaround to implement which the > architecture does not support natively e.g. distributed snapshot, it gets > quite tricky on maintaining and expanding the project. It also requires 3rd > parties to implement a separate source and sink implementation, which I'm > not sure how many 3rd parties actually followed so far. > > Eventually, "continuous mode" becomes an area no one in the active > community knows the details and has willingness to maintain. I wouldn't say > we are confident to remove the tag on "experimental", although the feature > has been shipped for years. It was introduced in Spark 2.3, surprising > enough? > > We went back and thought about the approach from scratch. Jerry came up > with the idea which leverages existing microbatch execution, hence > relatively stable and no need to require 3rd parties to support another > mode. It adds complexity against microbatch execution but it's a lot less > complicated compared to the existing continuous mode. Definitely quite less > than creating a new record-to-record engine from scratch. > > That said, we want to propose and move forward with the new approach. > > ps. Eventually we could probably discuss retiring continuous mode if the > new approach gets accepted and eventually considered as a stable one after > several minor releases. That's just me. > > On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng > wrote: > >> Hi all, >> >> I would like to start the discussion for a SPIP, Asynchronous Offset >> Management in Structured Streaming. The high level summary of the SPIP is >> that currently in Structured Streaming we perform a couple of offset >> management operations for progress tracking purposes synchronously on the >> critical path which can contribute significantly to processing latency. If >> we were to make these operations asynchronous and less frequent we can >> dramatically improve latency for certain types of workloads. >> >> I have put together a SPIP to implement such a mechanism. Please take a >> look! >> >> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591 >> >> SPIP doc: >> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing >> >> >> Best, >> >> Jerry >> >
Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming
Just FYI, I'm shepherding this SPIP project. I think the major meta question would be, "why don't we spend effort on continuous mode rather than initiating another feature aiming for the same workload?". Jerry already updated the doc to answer the question, but I can also share my thoughts about it. I feel like the current "continuous mode" is a niche solution. (It's not to blame. If you have to deal with such workload but can't rewrite the underlying engine from scratch, then there are really few options.) Since the implementation went with a workaround to implement which the architecture does not support natively e.g. distributed snapshot, it gets quite tricky on maintaining and expanding the project. It also requires 3rd parties to implement a separate source and sink implementation, which I'm not sure how many 3rd parties actually followed so far. Eventually, "continuous mode" becomes an area no one in the active community knows the details and has willingness to maintain. I wouldn't say we are confident to remove the tag on "experimental", although the feature has been shipped for years. It was introduced in Spark 2.3, surprising enough? We went back and thought about the approach from scratch. Jerry came up with the idea which leverages existing microbatch execution, hence relatively stable and no need to require 3rd parties to support another mode. It adds complexity against microbatch execution but it's a lot less complicated compared to the existing continuous mode. Definitely quite less than creating a new record-to-record engine from scratch. That said, we want to propose and move forward with the new approach. ps. Eventually we could probably discuss retiring continuous mode if the new approach gets accepted and eventually considered as a stable one after several minor releases. That's just me. On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng wrote: > Hi all, > > I would like to start the discussion for a SPIP, Asynchronous Offset > Management in Structured Streaming. The high level summary of the SPIP is > that currently in Structured Streaming we perform a couple of offset > management operations for progress tracking purposes synchronously on the > critical path which can contribute significantly to processing latency. If > we were to make these operations asynchronous and less frequent we can > dramatically improve latency for certain types of workloads. > > I have put together a SPIP to implement such a mechanism. Please take a > look! > > SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591 > > SPIP doc: > https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing > > > Best, > > Jerry >