Re: Contributing Twister2 runner to Apache Beam
Hi Ismaël, Thanks for the update, No problem at all, please take your time and let me know if my assistance is needed, The virus has affected everyone's timetables. I hope you are safe. Best Regards, Pulasthi On Fri, Apr 3, 2020 at 12:14 PM Ismaël Mejía wrote: > Hello Pulasthi, > > Please excuse me for my delay, I have probably 1/3 of my common > available time since the coronavirus lockdown so I have not advanced > as expected. I hope to catch up rapidly and ping you. Our expected > target of merging it before the 2.21.0 release seems to be hard to get > at this point because the branch will be cut next week. I hope this is > not a problem but if it is please excuse me. > > I also profit to ask any other Beamer that could have more free cycles > at the moment in case (s)he can give me an extra hand for the review. > > Regards, > Ismaël > > > On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe > wrote: > > > > Hi Ismaël > > > > Did you get some free time to perform a code review on the pull request > > > > Best Regards > > Pulasthi > > > > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik wrote: > >> > >> I have to disagree. Allowing for runners within the Apache Beam repo > and SDKs that reach into the implementation details of each other are > usability, feature development, maintenance and complexity problems. > >> > >> The usability issue comes from our public core facing APIs exposing > methods that runners "need" so they can introspect details that shouldn't > be visible to them (e.g. setWindowingStrategyInternal on PCollection). > Getting to 1 would remove the pipeline construction time instances but not > the execution side ones and there are currently 100+ usages of the > @Internal annotation. > >> > >> The feature development and maintenance issues both stem from > duplication of work. We need to have at least two copies of how to do > something, one that is for runner -> SDK direct and one for Fn API. An > example of this is the timer family work which was started and completed > for the non portable implementation yet the portable implementation was > left as future work. > >> > >> Finally, the complexity comes from how many layers we have that wrap > existing components to create variants for different use cases. I'm looking > at all the DoFnRunners and each of their variants and how those have layers > within themselves within the SDK and how additional layers have been made > to interface with runner specific internal details. > >> > >> > >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles > wrote: > >>> > >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 > (Fn API). But I disagree with it as a requirement; the whole point of > ptransforms with URNs is that if the runner can figure out how to execute > it according to semantics, then it is fine. A runner meets (1) and (2) but > can only run certain subset of DoFns is allowed by design (whether the > subset is based on language, state/timer support, etc). > >>> > >>> Kenn > >>> > >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik wrote: > > I would like to move away from having runners access APIs that are > related to pipeline construction and other internal SDK APIs and I would > like for SDKs to not inspect internal runner APIs. This would enable the > community to improve each independently without needing to fix the world > all the time and would enable the community to run a cluster that supports > multiple Beam versions at the same time and would also allow for the > cluster to be updated independently of the pipelines it runs. > > As a community, I believe we need to achieve 1, 2 and 3. Outside of > the Apache Beam repo, anyone can do whatever they want but there should be > no compatibility guarantees. > > 4 and 5 are extensions that enable a richer set of pipelines to run > and are optional like many other parts such as if a runner supports metrics > aggregation or dynamic work rebalancing. > > On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles > wrote: > > > > There are a lot of different meanings to "portable runner". Here are > some: > > > > (1) A runner that accepts a pipeline proto and either runs it or > says it cannot run it > > (2) A runner that accepts jobs via the job management APIs > > (3) A runner that executes UDFs via the Fn API > > (4) A runner that can execute multiple languages > > (5) A runner that can run cross-language transforms aka multiple > languages in the same pipeline > > > > I think (1) is a very good bar, and (2) is a nice addition on top of > that. Then we have a unified way to submit pipelines and understand their > status. > > > > I think (3) is optional - a runner can run things however it likes, > including with native implementations. And then (4) and (5) as well are > just levels of feature capabilities. > > > > Kenn > > > > On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik
Re: Contributing Twister2 runner to Apache Beam
Hello Pulasthi, Please excuse me for my delay, I have probably 1/3 of my common available time since the coronavirus lockdown so I have not advanced as expected. I hope to catch up rapidly and ping you. Our expected target of merging it before the 2.21.0 release seems to be hard to get at this point because the branch will be cut next week. I hope this is not a problem but if it is please excuse me. I also profit to ask any other Beamer that could have more free cycles at the moment in case (s)he can give me an extra hand for the review. Regards, Ismaël On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe wrote: > > Hi Ismaël > > Did you get some free time to perform a code review on the pull request > > Best Regards > Pulasthi > > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik wrote: >> >> I have to disagree. Allowing for runners within the Apache Beam repo and >> SDKs that reach into the implementation details of each other are usability, >> feature development, maintenance and complexity problems. >> >> The usability issue comes from our public core facing APIs exposing methods >> that runners "need" so they can introspect details that shouldn't be visible >> to them (e.g. setWindowingStrategyInternal on PCollection). Getting to 1 >> would remove the pipeline construction time instances but not the execution >> side ones and there are currently 100+ usages of the @Internal annotation. >> >> The feature development and maintenance issues both stem from duplication of >> work. We need to have at least two copies of how to do something, one that >> is for runner -> SDK direct and one for Fn API. An example of this is the >> timer family work which was started and completed for the non portable >> implementation yet the portable implementation was left as future work. >> >> Finally, the complexity comes from how many layers we have that wrap >> existing components to create variants for different use cases. I'm looking >> at all the DoFnRunners and each of their variants and how those have layers >> within themselves within the SDK and how additional layers have been made to >> interface with runner specific internal details. >> >> >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles wrote: >>> >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn >>> API). But I disagree with it as a requirement; the whole point of >>> ptransforms with URNs is that if the runner can figure out how to execute >>> it according to semantics, then it is fine. A runner meets (1) and (2) but >>> can only run certain subset of DoFns is allowed by design (whether the >>> subset is based on language, state/timer support, etc). >>> >>> Kenn >>> >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik wrote: I would like to move away from having runners access APIs that are related to pipeline construction and other internal SDK APIs and I would like for SDKs to not inspect internal runner APIs. This would enable the community to improve each independently without needing to fix the world all the time and would enable the community to run a cluster that supports multiple Beam versions at the same time and would also allow for the cluster to be updated independently of the pipelines it runs. As a community, I believe we need to achieve 1, 2 and 3. Outside of the Apache Beam repo, anyone can do whatever they want but there should be no compatibility guarantees. 4 and 5 are extensions that enable a richer set of pipelines to run and are optional like many other parts such as if a runner supports metrics aggregation or dynamic work rebalancing. On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles wrote: > > There are a lot of different meanings to "portable runner". Here are some: > > (1) A runner that accepts a pipeline proto and either runs it or says it > cannot run it > (2) A runner that accepts jobs via the job management APIs > (3) A runner that executes UDFs via the Fn API > (4) A runner that can execute multiple languages > (5) A runner that can run cross-language transforms aka multiple > languages in the same pipeline > > I think (1) is a very good bar, and (2) is a nice addition on top of > that. Then we have a unified way to submit pipelines and understand their > status. > > I think (3) is optional - a runner can run things however it likes, > including with native implementations. And then (4) and (5) as well are > just levels of feature capabilities. > > Kenn > > On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: >> >> +1 >> >> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel wrote: >>> >>> One last thing, for any runner after this one... wouldn't it be a good >>> acceptance criteria to only accept portable implementations anymore? >>> >>> _/ >>> _/ Alex
Re: Contributing Twister2 runner to Apache Beam
Hi Ismaël Did you get some free time to perform a code review on the pull request Best Regards Pulasthi On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik wrote: > I have to disagree. Allowing for runners within the Apache Beam repo and > SDKs that reach into the implementation details of each other are > usability, feature development, maintenance and complexity problems. > > The usability issue comes from our public core facing APIs exposing > methods that runners "need" so they can introspect details that shouldn't > be visible to them (e.g. setWindowingStrategyInternal on PCollection). > Getting to 1 would remove the pipeline construction time instances but not > the execution side ones and there are currently 100+ usages of > the @Internal annotation. > > The feature development and maintenance issues both stem from duplication > of work. We need to have at least two copies of how to do something, one > that is for runner -> SDK direct and one for Fn API. An example of this is > the timer family work which was started and completed for the non portable > implementation yet the portable implementation was left as future work. > > Finally, the complexity comes from how many layers we have that wrap > existing components to create variants for different use cases. I'm looking > at all the DoFnRunners and each of their variants and how those have layers > within themselves within the SDK and how additional layers have been made > to interface with runner specific internal details. > > > On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles wrote: > >> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn >> API). But I disagree with it as a requirement; the whole point of >> ptransforms with URNs is that if the runner can figure out how to execute >> it according to semantics, then it is fine. A runner meets (1) and (2) but >> can only run certain subset of DoFns is allowed by design (whether the >> subset is based on language, state/timer support, etc). >> >> Kenn >> >> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik wrote: >> >>> I would like to move away from having runners access APIs that are >>> related to pipeline construction and other internal SDK APIs and I would >>> like for SDKs to not inspect internal runner APIs. This would enable the >>> community to improve each independently without needing to fix the world >>> all the time and would enable the community to run a cluster that supports >>> multiple Beam versions at the same time and would also allow for the >>> cluster to be updated independently of the pipelines it runs. >>> >>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the >>> Apache Beam repo, anyone can do whatever they want but there should be no >>> compatibility guarantees. >>> >>> 4 and 5 are extensions that enable a richer set of pipelines to run and >>> are optional like many other parts such as if a runner supports metrics >>> aggregation or dynamic work rebalancing. >>> >>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles wrote: >>> There are a lot of different meanings to "portable runner". Here are some: (1) A runner that accepts a pipeline proto and either runs it or says it cannot run it (2) A runner that accepts jobs via the job management APIs (3) A runner that executes UDFs via the Fn API (4) A runner that can execute multiple languages (5) A runner that can run cross-language transforms aka multiple languages in the same pipeline I think (1) is a very good bar, and (2) is a nice addition on top of that. Then we have a unified way to submit pipelines and understand their status. I think (3) is optional - a runner can run things however it likes, including with native implementations. And then (4) and (5) as well are just levels of feature capabilities. Kenn On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: > +1 > > On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel > wrote: > >> One last thing, for any runner after this one... wouldn't it be a >> good acceptance criteria to only accept portable implementations anymore? >> >> _/ >> _/ Alex Van Boxel >> >> >> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía >> wrote: >> >>> Good points Kenn. I think we mostly agree on what has been discussed >>> in this >>> thread the pros/cons of having runners on our repository, but this >>> is probably >>> not the best moment in time to change any policy in that aspect. >>> >>> So if nobody objects I think we can proceed. I am OOO this week so >>> with less >>> time to continue with the code review, but I will be back to finish >>> the review >>> and hopefully finally get this merged with Pulasthi next week (sorry >>> for the >>> delay). >>> >>> > (don't wait for me on code review - if Ismaël said it is good, >>> then it is
Re: Contributing Twister2 runner to Apache Beam
I have to disagree. Allowing for runners within the Apache Beam repo and SDKs that reach into the implementation details of each other are usability, feature development, maintenance and complexity problems. The usability issue comes from our public core facing APIs exposing methods that runners "need" so they can introspect details that shouldn't be visible to them (e.g. setWindowingStrategyInternal on PCollection). Getting to 1 would remove the pipeline construction time instances but not the execution side ones and there are currently 100+ usages of the @Internal annotation. The feature development and maintenance issues both stem from duplication of work. We need to have at least two copies of how to do something, one that is for runner -> SDK direct and one for Fn API. An example of this is the timer family work which was started and completed for the non portable implementation yet the portable implementation was left as future work. Finally, the complexity comes from how many layers we have that wrap existing components to create variants for different use cases. I'm looking at all the DoFnRunners and each of their variants and how those have layers within themselves within the SDK and how additional layers have been made to interface with runner specific internal details. On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles wrote: > I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn > API). But I disagree with it as a requirement; the whole point of > ptransforms with URNs is that if the runner can figure out how to execute > it according to semantics, then it is fine. A runner meets (1) and (2) but > can only run certain subset of DoFns is allowed by design (whether the > subset is based on language, state/timer support, etc). > > Kenn > > On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik wrote: > >> I would like to move away from having runners access APIs that are >> related to pipeline construction and other internal SDK APIs and I would >> like for SDKs to not inspect internal runner APIs. This would enable the >> community to improve each independently without needing to fix the world >> all the time and would enable the community to run a cluster that supports >> multiple Beam versions at the same time and would also allow for the >> cluster to be updated independently of the pipelines it runs. >> >> As a community, I believe we need to achieve 1, 2 and 3. Outside of the >> Apache Beam repo, anyone can do whatever they want but there should be no >> compatibility guarantees. >> >> 4 and 5 are extensions that enable a richer set of pipelines to run and >> are optional like many other parts such as if a runner supports metrics >> aggregation or dynamic work rebalancing. >> >> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles wrote: >> >>> There are a lot of different meanings to "portable runner". Here are >>> some: >>> >>> (1) A runner that accepts a pipeline proto and either runs it or says it >>> cannot run it >>> (2) A runner that accepts jobs via the job management APIs >>> (3) A runner that executes UDFs via the Fn API >>> (4) A runner that can execute multiple languages >>> (5) A runner that can run cross-language transforms aka multiple >>> languages in the same pipeline >>> >>> I think (1) is a very good bar, and (2) is a nice addition on top of >>> that. Then we have a unified way to submit pipelines and understand their >>> status. >>> >>> I think (3) is optional - a runner can run things however it likes, >>> including with native implementations. And then (4) and (5) as well are >>> just levels of feature capabilities. >>> >>> Kenn >>> >>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: >>> +1 On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel wrote: > One last thing, for any runner after this one... wouldn't it be a good > acceptance criteria to only accept portable implementations anymore? > > _/ > _/ Alex Van Boxel > > > On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía > wrote: > >> Good points Kenn. I think we mostly agree on what has been discussed >> in this >> thread the pros/cons of having runners on our repository, but this is >> probably >> not the best moment in time to change any policy in that aspect. >> >> So if nobody objects I think we can proceed. I am OOO this week so >> with less >> time to continue with the code review, but I will be back to finish >> the review >> and hopefully finally get this merged with Pulasthi next week (sorry >> for the >> delay). >> >> > (don't wait for me on code review - if Ismaël said it is good, then >> it is >> > good.) >> >> Thanks for your confidence. Twister2 runners looks good so far, but I >> will >> confirm 100% next week :) In the meantime if someone has some extra >> cycles to >> take a look extra feedback is always welcome. >> >> On Mon, Mar 9,
Re: Contributing Twister2 runner to Apache Beam
I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn API). But I disagree with it as a requirement; the whole point of ptransforms with URNs is that if the runner can figure out how to execute it according to semantics, then it is fine. A runner meets (1) and (2) but can only run certain subset of DoFns is allowed by design (whether the subset is based on language, state/timer support, etc). Kenn On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik wrote: > I would like to move away from having runners access APIs that are related > to pipeline construction and other internal SDK APIs and I would like for > SDKs to not inspect internal runner APIs. This would enable the community > to improve each independently without needing to fix the world all the time > and would enable the community to run a cluster that supports multiple Beam > versions at the same time and would also allow for the cluster to be > updated independently of the pipelines it runs. > > As a community, I believe we need to achieve 1, 2 and 3. Outside of the > Apache Beam repo, anyone can do whatever they want but there should be no > compatibility guarantees. > > 4 and 5 are extensions that enable a richer set of pipelines to run and > are optional like many other parts such as if a runner supports metrics > aggregation or dynamic work rebalancing. > > On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles wrote: > >> There are a lot of different meanings to "portable runner". Here are some: >> >> (1) A runner that accepts a pipeline proto and either runs it or says it >> cannot run it >> (2) A runner that accepts jobs via the job management APIs >> (3) A runner that executes UDFs via the Fn API >> (4) A runner that can execute multiple languages >> (5) A runner that can run cross-language transforms aka multiple >> languages in the same pipeline >> >> I think (1) is a very good bar, and (2) is a nice addition on top of >> that. Then we have a unified way to submit pipelines and understand their >> status. >> >> I think (3) is optional - a runner can run things however it likes, >> including with native implementations. And then (4) and (5) as well are >> just levels of feature capabilities. >> >> Kenn >> >> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: >> >>> +1 >>> >>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel >>> wrote: >>> One last thing, for any runner after this one... wouldn't it be a good acceptance criteria to only accept portable implementations anymore? _/ _/ Alex Van Boxel On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía wrote: > Good points Kenn. I think we mostly agree on what has been discussed > in this > thread the pros/cons of having runners on our repository, but this is > probably > not the best moment in time to change any policy in that aspect. > > So if nobody objects I think we can proceed. I am OOO this week so > with less > time to continue with the code review, but I will be back to finish > the review > and hopefully finally get this merged with Pulasthi next week (sorry > for the > delay). > > > (don't wait for me on code review - if Ismaël said it is good, then > it is > > good.) > > Thanks for your confidence. Twister2 runners looks good so far, but I > will > confirm 100% next week :) In the meantime if someone has some extra > cycles to > take a look extra feedback is always welcome. > > On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles > wrote: > > > > I haven't heard anyone suggest that we need a vote. I haven't heard > anyone object to this being merged to master. Some time ago, we mostly > decided to favor master instead of branches, because it is so much > smoother > for contributors and users. > > > > So I am poking this thread one last time and otherwise I would > consider it consensus that once code review is done the runner is a part > of > Beam (experimental!). > > > > (don't wait for me on code review - if Ismaël said it is good, then > it is good.) > > > > Kenn > > > > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> > >> I understand that the discussion is on a more broad level than the > Twister2 runner. From my experience developing the runner the main > advantage of being inside the beam project was the easy access to the wide > range of tests and other core/utility code as Kyle pointed out. Unmerging > runners that are not properly maintained and updated would be the most > logical path to follow since the internals of the runners are only well > understood by developers of that particular project. It would be > unreasonable to expect the Beam community to maintain them. And since the > runners do not alter the core API's I assume they would be easy to unmerge
Re: Contributing Twister2 runner to Apache Beam
I would like to move away from having runners access APIs that are related to pipeline construction and other internal SDK APIs and I would like for SDKs to not inspect internal runner APIs. This would enable the community to improve each independently without needing to fix the world all the time and would enable the community to run a cluster that supports multiple Beam versions at the same time and would also allow for the cluster to be updated independently of the pipelines it runs. As a community, I believe we need to achieve 1, 2 and 3. Outside of the Apache Beam repo, anyone can do whatever they want but there should be no compatibility guarantees. 4 and 5 are extensions that enable a richer set of pipelines to run and are optional like many other parts such as if a runner supports metrics aggregation or dynamic work rebalancing. On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles wrote: > There are a lot of different meanings to "portable runner". Here are some: > > (1) A runner that accepts a pipeline proto and either runs it or says it > cannot run it > (2) A runner that accepts jobs via the job management APIs > (3) A runner that executes UDFs via the Fn API > (4) A runner that can execute multiple languages > (5) A runner that can run cross-language transforms aka multiple languages > in the same pipeline > > I think (1) is a very good bar, and (2) is a nice addition on top of that. > Then we have a unified way to submit pipelines and understand their status. > > I think (3) is optional - a runner can run things however it likes, > including with native implementations. And then (4) and (5) as well are > just levels of feature capabilities. > > Kenn > > On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: > >> +1 >> >> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel wrote: >> >>> One last thing, for any runner after this one... wouldn't it be a good >>> acceptance criteria to only accept portable implementations anymore? >>> >>> _/ >>> _/ Alex Van Boxel >>> >>> >>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía wrote: >>> Good points Kenn. I think we mostly agree on what has been discussed in this thread the pros/cons of having runners on our repository, but this is probably not the best moment in time to change any policy in that aspect. So if nobody objects I think we can proceed. I am OOO this week so with less time to continue with the code review, but I will be back to finish the review and hopefully finally get this merged with Pulasthi next week (sorry for the delay). > (don't wait for me on code review - if Ismaël said it is good, then it is > good.) Thanks for your confidence. Twister2 runners looks good so far, but I will confirm 100% next week :) In the meantime if someone has some extra cycles to take a look extra feedback is always welcome. On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles wrote: > > I haven't heard anyone suggest that we need a vote. I haven't heard anyone object to this being merged to master. Some time ago, we mostly decided to favor master instead of branches, because it is so much smoother for contributors and users. > > So I am poking this thread one last time and otherwise I would consider it consensus that once code review is done the runner is a part of Beam (experimental!). > > (don't wait for me on code review - if Ismaël said it is good, then it is good.) > > Kenn > > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: >> >> I understand that the discussion is on a more broad level than the Twister2 runner. From my experience developing the runner the main advantage of being inside the beam project was the easy access to the wide range of tests and other core/utility code as Kyle pointed out. Unmerging runners that are not properly maintained and updated would be the most logical path to follow since the internals of the runners are only well understood by developers of that particular project. It would be unreasonable to expect the Beam community to maintain them. And since the runners do not alter the core API's I assume they would be easy to unmerge if the need arises. >> >> Talking specifically about Twister2 runner, we hope to continue developing the runner in the future to add both streaming capability and develop a portable runner as well. The team behind Twister2 is working towards the goal to get the project into Apache Incubator in the near future (Hopefully to submit the proposal in the next couple of months). >> >> Best Regards, >> Pulasthi >> >> >> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw wrote: >>> >>> I think we will get to a point where it makes sense
Re: Contributing Twister2 runner to Apache Beam
There are a lot of different meanings to "portable runner". Here are some: (1) A runner that accepts a pipeline proto and either runs it or says it cannot run it (2) A runner that accepts jobs via the job management APIs (3) A runner that executes UDFs via the Fn API (4) A runner that can execute multiple languages (5) A runner that can run cross-language transforms aka multiple languages in the same pipeline I think (1) is a very good bar, and (2) is a nice addition on top of that. Then we have a unified way to submit pipelines and understand their status. I think (3) is optional - a runner can run things however it likes, including with native implementations. And then (4) and (5) as well are just levels of feature capabilities. Kenn On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik wrote: > +1 > > On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel wrote: > >> One last thing, for any runner after this one... wouldn't it be a good >> acceptance criteria to only accept portable implementations anymore? >> >> _/ >> _/ Alex Van Boxel >> >> >> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía wrote: >> >>> Good points Kenn. I think we mostly agree on what has been discussed in >>> this >>> thread the pros/cons of having runners on our repository, but this is >>> probably >>> not the best moment in time to change any policy in that aspect. >>> >>> So if nobody objects I think we can proceed. I am OOO this week so with >>> less >>> time to continue with the code review, but I will be back to finish the >>> review >>> and hopefully finally get this merged with Pulasthi next week (sorry for >>> the >>> delay). >>> >>> > (don't wait for me on code review - if Ismaël said it is good, then it >>> is >>> > good.) >>> >>> Thanks for your confidence. Twister2 runners looks good so far, but I >>> will >>> confirm 100% next week :) In the meantime if someone has some extra >>> cycles to >>> take a look extra feedback is always welcome. >>> >>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles wrote: >>> > >>> > I haven't heard anyone suggest that we need a vote. I haven't heard >>> anyone object to this being merged to master. Some time ago, we mostly >>> decided to favor master instead of branches, because it is so much smoother >>> for contributors and users. >>> > >>> > So I am poking this thread one last time and otherwise I would >>> consider it consensus that once code review is done the runner is a part of >>> Beam (experimental!). >>> > >>> > (don't wait for me on code review - if Ismaël said it is good, then it >>> is good.) >>> > >>> > Kenn >>> > >>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < >>> pulasthi...@gmail.com> wrote: >>> >> >>> >> I understand that the discussion is on a more broad level than the >>> Twister2 runner. From my experience developing the runner the main >>> advantage of being inside the beam project was the easy access to the wide >>> range of tests and other core/utility code as Kyle pointed out. Unmerging >>> runners that are not properly maintained and updated would be the most >>> logical path to follow since the internals of the runners are only well >>> understood by developers of that particular project. It would be >>> unreasonable to expect the Beam community to maintain them. And since the >>> runners do not alter the core API's I assume they would be easy to unmerge >>> if the need arises. >>> >> >>> >> Talking specifically about Twister2 runner, we hope to continue >>> developing the runner in the future to add both streaming capability and >>> develop a portable runner as well. The team behind Twister2 is working >>> towards the goal to get the project into Apache Incubator in the near >>> future (Hopefully to submit the proposal in the next couple of months). >>> >> >>> >> Best Regards, >>> >> Pulasthi >>> >> >>> >> >>> >> >>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw >>> wrote: >>> >>> >>> >>> I think we will get to a point where it makes sense for runners to >>> >>> live in their own repositories, with their own release cadence, but >>> >>> we're not at that point yet. One prerequisite is a stable API--we're >>> >>> closing in on that with the portability protos, but many (java) >>> >>> runners actually share the common runner core libraries and that is >>> >>> even less set in stone. >>> >>> >>> >>> On the other hand, taking responsibility for maintaining all runners >>> >>> is not a tenable or scalable position for the Beam project. If a >>> >>> runner is merged, it should be understood that it can be "un-merged" >>> >>> if it causes a maintenance burden. A completely separate >>> >>> project/repository makes this less messy. >>> >>> >>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles >>> wrote: >>> >>> > >>> >>> > I agree with both of you, mostly :-) >>> >>> > >>> >>> > The monorepo approach doesn't work/scale well for shipped >>> libraries (name a Google library that silently just works and never causes >>> any dependency problems) and the
Re: Contributing Twister2 runner to Apache Beam
+1 On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel wrote: > One last thing, for any runner after this one... wouldn't it be a good > acceptance criteria to only accept portable implementations anymore? > > _/ > _/ Alex Van Boxel > > > On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía wrote: > >> Good points Kenn. I think we mostly agree on what has been discussed in >> this >> thread the pros/cons of having runners on our repository, but this is >> probably >> not the best moment in time to change any policy in that aspect. >> >> So if nobody objects I think we can proceed. I am OOO this week so with >> less >> time to continue with the code review, but I will be back to finish the >> review >> and hopefully finally get this merged with Pulasthi next week (sorry for >> the >> delay). >> >> > (don't wait for me on code review - if Ismaël said it is good, then it >> is >> > good.) >> >> Thanks for your confidence. Twister2 runners looks good so far, but I will >> confirm 100% next week :) In the meantime if someone has some extra >> cycles to >> take a look extra feedback is always welcome. >> >> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles wrote: >> > >> > I haven't heard anyone suggest that we need a vote. I haven't heard >> anyone object to this being merged to master. Some time ago, we mostly >> decided to favor master instead of branches, because it is so much smoother >> for contributors and users. >> > >> > So I am poking this thread one last time and otherwise I would consider >> it consensus that once code review is done the runner is a part of Beam >> (experimental!). >> > >> > (don't wait for me on code review - if Ismaël said it is good, then it >> is good.) >> > >> > Kenn >> > >> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < >> pulasthi...@gmail.com> wrote: >> >> >> >> I understand that the discussion is on a more broad level than the >> Twister2 runner. From my experience developing the runner the main >> advantage of being inside the beam project was the easy access to the wide >> range of tests and other core/utility code as Kyle pointed out. Unmerging >> runners that are not properly maintained and updated would be the most >> logical path to follow since the internals of the runners are only well >> understood by developers of that particular project. It would be >> unreasonable to expect the Beam community to maintain them. And since the >> runners do not alter the core API's I assume they would be easy to unmerge >> if the need arises. >> >> >> >> Talking specifically about Twister2 runner, we hope to continue >> developing the runner in the future to add both streaming capability and >> develop a portable runner as well. The team behind Twister2 is working >> towards the goal to get the project into Apache Incubator in the near >> future (Hopefully to submit the proposal in the next couple of months). >> >> >> >> Best Regards, >> >> Pulasthi >> >> >> >> >> >> >> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw >> wrote: >> >>> >> >>> I think we will get to a point where it makes sense for runners to >> >>> live in their own repositories, with their own release cadence, but >> >>> we're not at that point yet. One prerequisite is a stable API--we're >> >>> closing in on that with the portability protos, but many (java) >> >>> runners actually share the common runner core libraries and that is >> >>> even less set in stone. >> >>> >> >>> On the other hand, taking responsibility for maintaining all runners >> >>> is not a tenable or scalable position for the Beam project. If a >> >>> runner is merged, it should be understood that it can be "un-merged" >> >>> if it causes a maintenance burden. A completely separate >> >>> project/repository makes this less messy. >> >>> >> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles >> wrote: >> >>> > >> >>> > I agree with both of you, mostly :-) >> >>> > >> >>> > The monorepo approach doesn't work/scale well for shipped libraries >> (name a Google library that silently just works and never causes any >> dependency problems) and the pain we feel has been constant and increasing, >> but I don't think we are at the breaking point. >> >>> > >> >>> > But Google's big monorepo [1] demonstrates similar benefits to what >> Kyle describes. In the early stages the benefit of not having to think too >> hard about build/test infra and share it everywhere is a big help, and it >> scales well. Eventually, shipping test utility libraries and compliance >> suites can be equivalent. And to your point - it is very helpful for users >> to know that they can use CassandraIO with the other Beam artifacts. This >> is why Google requires the whole big repo to depend on a single version of >> any externally-controlled artifact. But, yes, as a consequence it is >> preposterously difficult to stay up to date, since literally anything can >> block progress. You need a unified escalation chain for that policy to make >> sense. It is the definition of a
Re: Contributing Twister2 runner to Apache Beam
One last thing, for any runner after this one... wouldn't it be a good acceptance criteria to only accept portable implementations anymore? _/ _/ Alex Van Boxel On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía wrote: > Good points Kenn. I think we mostly agree on what has been discussed in > this > thread the pros/cons of having runners on our repository, but this is > probably > not the best moment in time to change any policy in that aspect. > > So if nobody objects I think we can proceed. I am OOO this week so with > less > time to continue with the code review, but I will be back to finish the > review > and hopefully finally get this merged with Pulasthi next week (sorry for > the > delay). > > > (don't wait for me on code review - if Ismaël said it is good, then it is > > good.) > > Thanks for your confidence. Twister2 runners looks good so far, but I will > confirm 100% next week :) In the meantime if someone has some extra cycles > to > take a look extra feedback is always welcome. > > On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles wrote: > > > > I haven't heard anyone suggest that we need a vote. I haven't heard > anyone object to this being merged to master. Some time ago, we mostly > decided to favor master instead of branches, because it is so much smoother > for contributors and users. > > > > So I am poking this thread one last time and otherwise I would consider > it consensus that once code review is done the runner is a part of Beam > (experimental!). > > > > (don't wait for me on code review - if Ismaël said it is good, then it > is good.) > > > > Kenn > > > > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> > >> I understand that the discussion is on a more broad level than the > Twister2 runner. From my experience developing the runner the main > advantage of being inside the beam project was the easy access to the wide > range of tests and other core/utility code as Kyle pointed out. Unmerging > runners that are not properly maintained and updated would be the most > logical path to follow since the internals of the runners are only well > understood by developers of that particular project. It would be > unreasonable to expect the Beam community to maintain them. And since the > runners do not alter the core API's I assume they would be easy to unmerge > if the need arises. > >> > >> Talking specifically about Twister2 runner, we hope to continue > developing the runner in the future to add both streaming capability and > develop a portable runner as well. The team behind Twister2 is working > towards the goal to get the project into Apache Incubator in the near > future (Hopefully to submit the proposal in the next couple of months). > >> > >> Best Regards, > >> Pulasthi > >> > >> > >> > >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw > wrote: > >>> > >>> I think we will get to a point where it makes sense for runners to > >>> live in their own repositories, with their own release cadence, but > >>> we're not at that point yet. One prerequisite is a stable API--we're > >>> closing in on that with the portability protos, but many (java) > >>> runners actually share the common runner core libraries and that is > >>> even less set in stone. > >>> > >>> On the other hand, taking responsibility for maintaining all runners > >>> is not a tenable or scalable position for the Beam project. If a > >>> runner is merged, it should be understood that it can be "un-merged" > >>> if it causes a maintenance burden. A completely separate > >>> project/repository makes this less messy. > >>> > >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles > wrote: > >>> > > >>> > I agree with both of you, mostly :-) > >>> > > >>> > The monorepo approach doesn't work/scale well for shipped libraries > (name a Google library that silently just works and never causes any > dependency problems) and the pain we feel has been constant and increasing, > but I don't think we are at the breaking point. > >>> > > >>> > But Google's big monorepo [1] demonstrates similar benefits to what > Kyle describes. In the early stages the benefit of not having to think too > hard about build/test infra and share it everywhere is a big help, and it > scales well. Eventually, shipping test utility libraries and compliance > suites can be equivalent. And to your point - it is very helpful for users > to know that they can use CassandraIO with the other Beam artifacts. This > is why Google requires the whole big repo to depend on a single version of > any externally-controlled artifact. But, yes, as a consequence it is > preposterously difficult to stay up to date, since literally anything can > block progress. You need a unified escalation chain for that policy to make > sense. It is the definition of a healthy Apache project to *not* have that > (PMC is different). > >>> > > >>> > Independent dependencies, independent git histories, and independent > release cadence/process
Re: Contributing Twister2 runner to Apache Beam
Good points Kenn. I think we mostly agree on what has been discussed in this thread the pros/cons of having runners on our repository, but this is probably not the best moment in time to change any policy in that aspect. So if nobody objects I think we can proceed. I am OOO this week so with less time to continue with the code review, but I will be back to finish the review and hopefully finally get this merged with Pulasthi next week (sorry for the delay). > (don't wait for me on code review - if Ismaël said it is good, then it is > good.) Thanks for your confidence. Twister2 runners looks good so far, but I will confirm 100% next week :) In the meantime if someone has some extra cycles to take a look extra feedback is always welcome. On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles wrote: > > I haven't heard anyone suggest that we need a vote. I haven't heard anyone > object to this being merged to master. Some time ago, we mostly decided to > favor master instead of branches, because it is so much smoother for > contributors and users. > > So I am poking this thread one last time and otherwise I would consider it > consensus that once code review is done the runner is a part of Beam > (experimental!). > > (don't wait for me on code review - if Ismaël said it is good, then it is > good.) > > Kenn > > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe > wrote: >> >> I understand that the discussion is on a more broad level than the Twister2 >> runner. From my experience developing the runner the main advantage of being >> inside the beam project was the easy access to the wide range of tests and >> other core/utility code as Kyle pointed out. Unmerging runners that are not >> properly maintained and updated would be the most logical path to follow >> since the internals of the runners are only well understood by developers of >> that particular project. It would be unreasonable to expect the Beam >> community to maintain them. And since the runners do not alter the core >> API's I assume they would be easy to unmerge if the need arises. >> >> Talking specifically about Twister2 runner, we hope to continue developing >> the runner in the future to add both streaming capability and develop a >> portable runner as well. The team behind Twister2 is working towards the >> goal to get the project into Apache Incubator in the near future (Hopefully >> to submit the proposal in the next couple of months). >> >> Best Regards, >> Pulasthi >> >> >> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw wrote: >>> >>> I think we will get to a point where it makes sense for runners to >>> live in their own repositories, with their own release cadence, but >>> we're not at that point yet. One prerequisite is a stable API--we're >>> closing in on that with the portability protos, but many (java) >>> runners actually share the common runner core libraries and that is >>> even less set in stone. >>> >>> On the other hand, taking responsibility for maintaining all runners >>> is not a tenable or scalable position for the Beam project. If a >>> runner is merged, it should be understood that it can be "un-merged" >>> if it causes a maintenance burden. A completely separate >>> project/repository makes this less messy. >>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles wrote: >>> > >>> > I agree with both of you, mostly :-) >>> > >>> > The monorepo approach doesn't work/scale well for shipped libraries (name >>> > a Google library that silently just works and never causes any dependency >>> > problems) and the pain we feel has been constant and increasing, but I >>> > don't think we are at the breaking point. >>> > >>> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle >>> > describes. In the early stages the benefit of not having to think too >>> > hard about build/test infra and share it everywhere is a big help, and it >>> > scales well. Eventually, shipping test utility libraries and compliance >>> > suites can be equivalent. And to your point - it is very helpful for >>> > users to know that they can use CassandraIO with the other Beam >>> > artifacts. This is why Google requires the whole big repo to depend on a >>> > single version of any externally-controlled artifact. But, yes, as a >>> > consequence it is preposterously difficult to stay up to date, since >>> > literally anything can block progress. You need a unified escalation >>> > chain for that policy to make sense. It is the definition of a healthy >>> > Apache project to *not* have that (PMC is different). >>> > >>> > Independent dependencies, independent git histories, and independent >>> > release cadence/process are all separate discussions. >>> > >>> > It is a broader question than this particular contribution, so let's >>> > merge this runner before changing our whole way of doing things :-) >>> > >>> > Kenn >>> > >>> > [1] >>> >
Re: Contributing Twister2 runner to Apache Beam
I haven't heard anyone suggest that we need a vote. I haven't heard anyone object to this being merged to master. Some time ago, we mostly decided to favor master instead of branches, because it is so much smoother for contributors and users. So I am poking this thread one last time and otherwise I would consider it consensus that once code review is done the runner is a part of Beam (experimental!). (don't wait for me on code review - if Ismaël said it is good, then it is good.) Kenn On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > I understand that the discussion is on a more broad level than the > Twister2 runner. From my experience developing the runner the main > advantage of being inside the beam project was the easy access to the wide > range of tests and other core/utility code as Kyle pointed out. Unmerging > runners that are not properly maintained and updated would be the most > logical path to follow since the internals of the runners are only well > understood by developers of that particular project. It would be > unreasonable to expect the Beam community to maintain them. And since the > runners do not alter the core API's I assume they would be easy to unmerge > if the need arises. > > Talking specifically about Twister2 runner, we hope to continue developing > the runner in the future to add both streaming capability and develop a > portable runner as well. The team behind Twister2 is working towards the > goal to get the project into Apache Incubator in the near future (Hopefully > to submit the proposal in the next couple of months). > > Best Regards, > Pulasthi > > > > On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw > wrote: > >> I think we will get to a point where it makes sense for runners to >> live in their own repositories, with their own release cadence, but >> we're not at that point yet. One prerequisite is a stable API--we're >> closing in on that with the portability protos, but many (java) >> runners actually share the common runner core libraries and that is >> even less set in stone. >> >> On the other hand, taking responsibility for maintaining all runners >> is not a tenable or scalable position for the Beam project. If a >> runner is merged, it should be understood that it can be "un-merged" >> if it causes a maintenance burden. A completely separate >> project/repository makes this less messy. >> >> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles wrote: >> > >> > I agree with both of you, mostly :-) >> > >> > The monorepo approach doesn't work/scale well for shipped libraries >> (name a Google library that silently just works and never causes any >> dependency problems) and the pain we feel has been constant and increasing, >> but I don't think we are at the breaking point. >> > >> > But Google's big monorepo [1] demonstrates similar benefits to what >> Kyle describes. In the early stages the benefit of not having to think too >> hard about build/test infra and share it everywhere is a big help, and it >> scales well. Eventually, shipping test utility libraries and compliance >> suites can be equivalent. And to your point - it is very helpful for users >> to know that they can use CassandraIO with the other Beam artifacts. This >> is why Google requires the whole big repo to depend on a single version of >> any externally-controlled artifact. But, yes, as a consequence it is >> preposterously difficult to stay up to date, since literally anything can >> block progress. You need a unified escalation chain for that policy to make >> sense. It is the definition of a healthy Apache project to *not* have that >> (PMC is different). >> > >> > Independent dependencies, independent git histories, and independent >> release cadence/process are all separate discussions. >> > >> > It is a broader question than this particular contribution, so let's >> merge this runner before changing our whole way of doing things :-) >> > >> > Kenn >> > >> > [1] >> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext >> (really quite a balanced analysis) >> > >> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver >> wrote: >> >> >> >> > Should runners, current and future, be in the same repository as Beam >> >> > core? >> >> >> >> In the distant past, runners lived in their own repositories, and then >> were donated to Beam. But Beam's current uber-repo setup allows a lot of >> convenience. For example, a ton of code (including core functionality and >> tests) is shared directly between runners, which is useful for keeping >> runners up to date and ensuring consistent behavior between them (in other >> words, maintainable and reliable). >> >> >> >> Generally, it is up to the authors of a particular Beam related >> project/subproject to decide whether to host their code in Beam or in a >> different repo, and up to the community to decide whether to take on the >> donation, as
Re: Contributing Twister2 runner to Apache Beam
I understand that the discussion is on a more broad level than the Twister2 runner. From my experience developing the runner the main advantage of being inside the beam project was the easy access to the wide range of tests and other core/utility code as Kyle pointed out. Unmerging runners that are not properly maintained and updated would be the most logical path to follow since the internals of the runners are only well understood by developers of that particular project. It would be unreasonable to expect the Beam community to maintain them. And since the runners do not alter the core API's I assume they would be easy to unmerge if the need arises. Talking specifically about Twister2 runner, we hope to continue developing the runner in the future to add both streaming capability and develop a portable runner as well. The team behind Twister2 is working towards the goal to get the project into Apache Incubator in the near future (Hopefully to submit the proposal in the next couple of months). Best Regards, Pulasthi On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw wrote: > I think we will get to a point where it makes sense for runners to > live in their own repositories, with their own release cadence, but > we're not at that point yet. One prerequisite is a stable API--we're > closing in on that with the portability protos, but many (java) > runners actually share the common runner core libraries and that is > even less set in stone. > > On the other hand, taking responsibility for maintaining all runners > is not a tenable or scalable position for the Beam project. If a > runner is merged, it should be understood that it can be "un-merged" > if it causes a maintenance burden. A completely separate > project/repository makes this less messy. > > On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles wrote: > > > > I agree with both of you, mostly :-) > > > > The monorepo approach doesn't work/scale well for shipped libraries > (name a Google library that silently just works and never causes any > dependency problems) and the pain we feel has been constant and increasing, > but I don't think we are at the breaking point. > > > > But Google's big monorepo [1] demonstrates similar benefits to what Kyle > describes. In the early stages the benefit of not having to think too hard > about build/test infra and share it everywhere is a big help, and it scales > well. Eventually, shipping test utility libraries and compliance suites can > be equivalent. And to your point - it is very helpful for users to know > that they can use CassandraIO with the other Beam artifacts. This is why > Google requires the whole big repo to depend on a single version of any > externally-controlled artifact. But, yes, as a consequence it is > preposterously difficult to stay up to date, since literally anything can > block progress. You need a unified escalation chain for that policy to make > sense. It is the definition of a healthy Apache project to *not* have that > (PMC is different). > > > > Independent dependencies, independent git histories, and independent > release cadence/process are all separate discussions. > > > > It is a broader question than this particular contribution, so let's > merge this runner before changing our whole way of doing things :-) > > > > Kenn > > > > [1] > https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext > (really quite a balanced analysis) > > > > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver wrote: > >> > >> > Should runners, current and future, be in the same repository as Beam > >> > core? > >> > >> In the distant past, runners lived in their own repositories, and then > were donated to Beam. But Beam's current uber-repo setup allows a lot of > convenience. For example, a ton of code (including core functionality and > tests) is shared directly between runners, which is useful for keeping > runners up to date and ensuring consistent behavior between them (in other > words, maintainable and reliable). > >> > >> Generally, it is up to the authors of a particular Beam related > project/subproject to decide whether to host their code in Beam or in a > different repo, and up to the community to decide whether to take on the > donation, as discussed in previous threads on the Twister2 runner. In this > case, it seems there is agreement between the Twister2 runner authors and > the community that the runner can be hosted in Beam proper. > >> > >> There are examples of successful independent Beam projects, such as > Spotify's Scio, but having an independent project with its own releases > requires a lot of dedicated resources, and the bar for entry for extending > Beam should not be that high. All that's required of subproject authors is > that they keep the subproject in step with Beam. If they can't maintain it > any longer, the subproject can be allowed to bitrot without getting in > anyone's way. On the other hand, I'm not sure
Re: Contributing Twister2 runner to Apache Beam
I think we will get to a point where it makes sense for runners to live in their own repositories, with their own release cadence, but we're not at that point yet. One prerequisite is a stable API--we're closing in on that with the portability protos, but many (java) runners actually share the common runner core libraries and that is even less set in stone. On the other hand, taking responsibility for maintaining all runners is not a tenable or scalable position for the Beam project. If a runner is merged, it should be understood that it can be "un-merged" if it causes a maintenance burden. A completely separate project/repository makes this less messy. On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles wrote: > > I agree with both of you, mostly :-) > > The monorepo approach doesn't work/scale well for shipped libraries (name a > Google library that silently just works and never causes any dependency > problems) and the pain we feel has been constant and increasing, but I don't > think we are at the breaking point. > > But Google's big monorepo [1] demonstrates similar benefits to what Kyle > describes. In the early stages the benefit of not having to think too hard > about build/test infra and share it everywhere is a big help, and it scales > well. Eventually, shipping test utility libraries and compliance suites can > be equivalent. And to your point - it is very helpful for users to know that > they can use CassandraIO with the other Beam artifacts. This is why Google > requires the whole big repo to depend on a single version of any > externally-controlled artifact. But, yes, as a consequence it is > preposterously difficult to stay up to date, since literally anything can > block progress. You need a unified escalation chain for that policy to make > sense. It is the definition of a healthy Apache project to *not* have that > (PMC is different). > > Independent dependencies, independent git histories, and independent release > cadence/process are all separate discussions. > > It is a broader question than this particular contribution, so let's merge > this runner before changing our whole way of doing things :-) > > Kenn > > [1] > https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext > (really quite a balanced analysis) > > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver wrote: >> >> > Should runners, current and future, be in the same repository as Beam >> > core? >> >> In the distant past, runners lived in their own repositories, and then were >> donated to Beam. But Beam's current uber-repo setup allows a lot of >> convenience. For example, a ton of code (including core functionality and >> tests) is shared directly between runners, which is useful for keeping >> runners up to date and ensuring consistent behavior between them (in other >> words, maintainable and reliable). >> >> Generally, it is up to the authors of a particular Beam related >> project/subproject to decide whether to host their code in Beam or in a >> different repo, and up to the community to decide whether to take on the >> donation, as discussed in previous threads on the Twister2 runner. In this >> case, it seems there is agreement between the Twister2 runner authors and >> the community that the runner can be hosted in Beam proper. >> >> There are examples of successful independent Beam projects, such as >> Spotify's Scio, but having an independent project with its own releases >> requires a lot of dedicated resources, and the bar for entry for extending >> Beam should not be that high. All that's required of subproject authors is >> that they keep the subproject in step with Beam. If they can't maintain it >> any longer, the subproject can be allowed to bitrot without getting in >> anyone's way. On the other hand, I'm not sure of the details with Cassandra, >> but in general, a subproject should not have "the ability to block progress" >> just because it is contained in the Beam uber-repo. >> >> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are >> few enough to be handled on a case-by-case basis. >> >> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold >> wrote: >>> >>> Generic question without commenting on Twister2 specifically: >>> >>> Should runners, current and future, be in the same repository as Beam >>> core? Can or should they be completely separate products with their >>> own release cycles? >>> >>> Generally, loose coupling leads to more maintainable, reliable >>> projects. Specifically, Cassandra is holding back some other changes >>> in Beam and I really wish it didn't have the ability to block >>> progress. The more different runners we have in core, the worse this >>> problem is likely to become. >>> >>> >>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe >>> wrote: >>> > >>> > Hi >>> > >>> > I believe the pull request is pretty complete now with the help of >>> >
Re: Contributing Twister2 runner to Apache Beam
I agree with both of you, mostly :-) The monorepo approach doesn't work/scale well for shipped libraries (name a Google library that silently just works and never causes any dependency problems) and the pain we feel has been constant and increasing, but I don't think we are at the breaking point. But Google's big monorepo [1] demonstrates similar benefits to what Kyle describes. In the early stages the benefit of not having to think too hard about build/test infra and share it everywhere is a big help, and it scales well. Eventually, shipping test utility libraries and compliance suites can be equivalent. And to your point - it is very helpful for users to know that they can use CassandraIO with the other Beam artifacts. This is why Google requires the whole big repo to depend on a single version of any externally-controlled artifact. But, yes, as a consequence it is preposterously difficult to stay up to date, since literally anything can block progress. You need a unified escalation chain for that policy to make sense. It is the definition of a healthy Apache project to *not* have that (PMC is different). Independent dependencies, independent git histories, and independent release cadence/process are all separate discussions. It is a broader question than this particular contribution, so let's merge this runner before changing our whole way of doing things :-) Kenn [1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext (really quite a balanced analysis) On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver wrote: > > Should runners, current and future, be in the same repository as Beam > > core? > > In the distant past, runners lived in their own repositories, and then > were donated to Beam. But Beam's current uber-repo setup allows a lot of > convenience. For example, a ton of code (including core functionality and > tests) is shared directly between runners, which is useful for keeping > runners up to date and ensuring consistent behavior between them (in other > words, maintainable and reliable). > > Generally, it is up to the authors of a particular Beam related > project/subproject to decide whether to host their code in Beam or in a > different repo, and up to the community to decide whether to take on the > donation, as discussed in previous threads on the Twister2 runner. In this > case, it seems there is agreement between the Twister2 runner authors and > the community that the runner can be hosted in Beam proper. > > There are examples of successful independent Beam projects, such as > Spotify's Scio, but having an independent project with its own releases > requires a lot of dedicated resources, and the bar for entry for extending > Beam should not be that high. All that's required of subproject authors is > that they keep the subproject in step with Beam. If they can't maintain it > any longer, the subproject can be allowed to bitrot without getting in > anyone's way. On the other hand, I'm not sure of the details with > Cassandra, but in general, a subproject should not have "the ability to > block progress" just because it is contained in the Beam uber-repo. > > tl;dr Having an uber repo generally seems to work for Beam. Exceptions are > few enough to be handled on a case-by-case basis. > > On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold > wrote: > >> Generic question without commenting on Twister2 specifically: >> >> Should runners, current and future, be in the same repository as Beam >> core? Can or should they be completely separate products with their >> own release cycles? >> >> Generally, loose coupling leads to more maintainable, reliable >> projects. Specifically, Cassandra is holding back some other changes >> in Beam and I really wish it didn't have the ability to block >> progress. The more different runners we have in core, the worse this >> problem is likely to become. >> >> >> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe >> wrote: >> > >> > Hi >> > >> > I believe the pull request is pretty complete now with the help of >> Ismaël. Kenn, would you be able to take a look at it and suggest any >> changes if needed?. The build checks and validations tests are passing at >> the moment. I will start working on the documentation that you mentioned >> in an earlier email separately. >> > >> > Best Regards, >> > Pulasthi >> > >> > >> > >> > >> > >> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe < >> pulasthi...@gmail.com> wrote: >> >> >> >> Hi All, >> >> >> >> I have created the initial pull request [1] to contribute the Twister2 >> Beam runner to the Apache Beam codebase. More information on Twister2 can >> be found here[2] and the Twister2 codebase is available here[3]. At the >> moment only batch mode is supported in the runner, but we are planning to >> add stream support and implement a portable runner for Twister2 in the near >> future. >> >> >> >> As Kenn pointed out
Re: Contributing Twister2 runner to Apache Beam
> Should runners, current and future, be in the same repository as Beam > core? In the distant past, runners lived in their own repositories, and then were donated to Beam. But Beam's current uber-repo setup allows a lot of convenience. For example, a ton of code (including core functionality and tests) is shared directly between runners, which is useful for keeping runners up to date and ensuring consistent behavior between them (in other words, maintainable and reliable). Generally, it is up to the authors of a particular Beam related project/subproject to decide whether to host their code in Beam or in a different repo, and up to the community to decide whether to take on the donation, as discussed in previous threads on the Twister2 runner. In this case, it seems there is agreement between the Twister2 runner authors and the community that the runner can be hosted in Beam proper. There are examples of successful independent Beam projects, such as Spotify's Scio, but having an independent project with its own releases requires a lot of dedicated resources, and the bar for entry for extending Beam should not be that high. All that's required of subproject authors is that they keep the subproject in step with Beam. If they can't maintain it any longer, the subproject can be allowed to bitrot without getting in anyone's way. On the other hand, I'm not sure of the details with Cassandra, but in general, a subproject should not have "the ability to block progress" just because it is contained in the Beam uber-repo. tl;dr Having an uber repo generally seems to work for Beam. Exceptions are few enough to be handled on a case-by-case basis. On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold wrote: > Generic question without commenting on Twister2 specifically: > > Should runners, current and future, be in the same repository as Beam > core? Can or should they be completely separate products with their > own release cycles? > > Generally, loose coupling leads to more maintainable, reliable > projects. Specifically, Cassandra is holding back some other changes > in Beam and I really wish it didn't have the ability to block > progress. The more different runners we have in core, the worse this > problem is likely to become. > > > On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe > wrote: > > > > Hi > > > > I believe the pull request is pretty complete now with the help of > Ismaël. Kenn, would you be able to take a look at it and suggest any > changes if needed?. The build checks and validations tests are passing at > the moment. I will start working on the documentation that you mentioned > in an earlier email separately. > > > > Best Regards, > > Pulasthi > > > > > > > > > > > > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe < > pulasthi...@gmail.com> wrote: > >> > >> Hi All, > >> > >> I have created the initial pull request [1] to contribute the Twister2 > Beam runner to the Apache Beam codebase. More information on Twister2 can > be found here[2] and the Twister2 codebase is available here[3]. At the > moment only batch mode is supported in the runner, but we are planning to > add stream support and implement a portable runner for Twister2 in the near > future. > >> > >> As Kenn pointed out in an earlier email it would be great to have > inputs from the community regarding this contribution since it is a sizable > one. I am sure there are many improvements that can be done in the > contributed codebase with input from the community. > >> > >> [1] https://github.com/apache/beam/pull/10888 > >> [2] https://twister2.org/ > >> [3] https://github.com/DSC-SPIDAL/twister2 > >> > >> Best Regards, > >> Pulasthi > >> -- > >> Pulasthi S. Wickramasinghe > >> PhD Candidate | Research Assistant > >> School of Informatics and Computing | Digital Science Center > >> Indiana University, Bloomington > >> cell: 224-386-9035 <(224)%20386-9035> > > > > > > > > -- > > Pulasthi S. Wickramasinghe > > PhD Candidate | Research Assistant > > School of Informatics and Computing | Digital Science Center > > Indiana University, Bloomington > > cell: 224-386-9035 <(224)%20386-9035> > > > > -- > Elliotte Rusty Harold > elh...@ibiblio.org >
Re: Contributing Twister2 runner to Apache Beam
Generic question without commenting on Twister2 specifically: Should runners, current and future, be in the same repository as Beam core? Can or should they be completely separate products with their own release cycles? Generally, loose coupling leads to more maintainable, reliable projects. Specifically, Cassandra is holding back some other changes in Beam and I really wish it didn't have the ability to block progress. The more different runners we have in core, the worse this problem is likely to become. On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe wrote: > > Hi > > I believe the pull request is pretty complete now with the help of Ismaël. > Kenn, would you be able to take a look at it and suggest any changes if > needed?. The build checks and validations tests are passing at the moment. I > will start working on the documentation that you mentioned in an earlier > email separately. > > Best Regards, > Pulasthi > > > > > > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe > wrote: >> >> Hi All, >> >> I have created the initial pull request [1] to contribute the Twister2 Beam >> runner to the Apache Beam codebase. More information on Twister2 can be >> found here[2] and the Twister2 codebase is available here[3]. At the moment >> only batch mode is supported in the runner, but we are planning to add >> stream support and implement a portable runner for Twister2 in the near >> future. >> >> As Kenn pointed out in an earlier email it would be great to have inputs >> from the community regarding this contribution since it is a sizable one. I >> am sure there are many improvements that can be done in the contributed >> codebase with input from the community. >> >> [1] https://github.com/apache/beam/pull/10888 >> [2] https://twister2.org/ >> [3] https://github.com/DSC-SPIDAL/twister2 >> >> Best Regards, >> Pulasthi >> -- >> Pulasthi S. Wickramasinghe >> PhD Candidate | Research Assistant >> School of Informatics and Computing | Digital Science Center >> Indiana University, Bloomington >> cell: 224-386-9035 > > > > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 -- Elliotte Rusty Harold elh...@ibiblio.org
Re: Contributing Twister2 runner to Apache Beam
Hi I believe the pull request is pretty complete now with the help of Ismaël. Kenn, would you be able to take a look at it and suggest any changes if needed?. The build checks and validations tests are passing at the moment. I will start working on the documentation that you mentioned in an earlier email separately. Best Regards, Pulasthi On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe < pulasthi...@gmail.com> wrote: > Hi All, > > I have created the initial pull request [1] to contribute the Twister2 > Beam runner to the Apache Beam codebase. More information on Twister2 can > be found here[2] and the Twister2 codebase is available here[3]. At the > moment only batch mode is supported in the runner, but we are planning to > add stream support and implement a portable runner for Twister2 in the near > future. > > As Kenn pointed out in an earlier email it would be great to have inputs > from the community regarding this contribution since it is a sizable one. I > am sure there are many improvements that can be done in the contributed > codebase with input from the community. > > [1] https://github.com/apache/beam/pull/10888 > [2] https://twister2.org/ > [3] https://github.com/DSC-SPIDAL/twister2 > > Best Regards, > Pulasthi > -- > Pulasthi S. Wickramasinghe > PhD Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington > cell: 224-386-9035 > -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035
Contributing Twister2 runner to Apache Beam
Hi All, I have created the initial pull request [1] to contribute the Twister2 Beam runner to the Apache Beam codebase. More information on Twister2 can be found here[2] and the Twister2 codebase is available here[3]. At the moment only batch mode is supported in the runner, but we are planning to add stream support and implement a portable runner for Twister2 in the near future. As Kenn pointed out in an earlier email it would be great to have inputs from the community regarding this contribution since it is a sizable one. I am sure there are many improvements that can be done in the contributed codebase with input from the community. [1] https://github.com/apache/beam/pull/10888 [2] https://twister2.org/ [3] https://github.com/DSC-SPIDAL/twister2 Best Regards, Pulasthi -- Pulasthi S. Wickramasinghe PhD Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington cell: 224-386-9035