subject:"Contributing Twister2 runner to Apache Beam"

Re: Contributing Twister2 runner to Apache Beam

2020-04-03 Thread Pulasthi Supun Wickramasinghe

Hi Ismaël,

Thanks for the update, No problem at all, please take your time and let me
know if my assistance is needed, The virus has affected everyone's
timetables. I hope you are safe.

Best Regards,
Pulasthi

On Fri, Apr 3, 2020 at 12:14 PM Ismaël Mejía  wrote:

> Hello Pulasthi,
>
> Please excuse me for my delay, I have probably 1/3 of my common
> available time since the coronavirus lockdown so I have not advanced
> as expected. I hope to catch up rapidly and ping you. Our expected
> target of merging it before the 2.21.0 release seems to be hard to get
> at this point because the branch will be cut next week. I hope this is
> not a problem but if it is please excuse me.
>
> I also profit to ask any other Beamer that could have more free cycles
> at the moment in case (s)he can give me an extra hand for the review.
>
> Regards,
> Ismaël
>
>
> On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
>  wrote:
> >
> > Hi Ismaël
> >
> > Did you get some free time to perform a code review on the pull request
> >
> > Best Regards
> > Pulasthi
> >
> > On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik  wrote:
> >>
> >> I have to disagree. Allowing for runners within the Apache Beam repo
> and SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
> >>
> >> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of the
> @Internal annotation.
> >>
> >> The feature development and maintenance issues both stem from
> duplication of work. We need to have at least two copies of how to do
> something, one that is for runner -> SDK direct and one for Fn API. An
> example of this is the timer family work which was started and completed
> for the non portable implementation yet the portable implementation was
> left as future work.
> >>
> >> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
> >>
> >>
> >> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles 
> wrote:
> >>>
> >>> I do support all the efforts to get Dataflow, Flink, and Spark to 3
> (Fn API). But I disagree with it as a requirement; the whole point of
> ptransforms with URNs is that if the runner can figure out how to execute
> it according to semantics, then it is fine. A runner meets (1) and (2) but
> can only run certain subset of DoFns is allowed by design (whether the
> subset is based on language, state/timer support, etc).
> >>>
> >>> Kenn
> >>>
> >>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik  wrote:
> 
>  I would like to move away from having runners access APIs that are
> related to pipeline construction and other internal SDK APIs and I would
> like for SDKs to not inspect internal runner APIs. This would enable the
> community to improve each independently without needing to fix the world
> all the time and would enable the community to run a cluster that supports
> multiple Beam versions at the same time and would also allow for the
> cluster to be updated independently of the pipelines it runs.
> 
>  As a community, I believe we need to achieve 1, 2 and 3. Outside of
> the Apache Beam repo, anyone can do whatever they want but there should be
> no compatibility guarantees.
> 
>  4 and 5 are extensions that enable a richer set of pipelines to run
> and are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
> 
>  On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles 
> wrote:
> >
> > There are a lot of different meanings to "portable runner". Here are
> some:
> >
> > (1) A runner that accepts a pipeline proto and either runs it or
> says it cannot run it
> > (2) A runner that accepts jobs via the job management APIs
> > (3) A runner that executes UDFs via the Fn API
> > (4) A runner that can execute multiple languages
> > (5) A runner that can run cross-language transforms aka multiple
> languages in the same pipeline
> >
> > I think (1) is a very good bar, and (2) is a nice addition on top of
> that. Then we have a unified way to submit pipelines and understand their
> status.
> >
> > I think (3) is optional - a runner can run things however it likes,
> including with native implementations. And then (4) and (5) as well are
> just levels of feature capabilities.
> >
> > Kenn
> >
> > On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik

Re: Contributing Twister2 runner to Apache Beam

2020-04-03 Thread Ismaël Mejía

Hello Pulasthi,

Please excuse me for my delay, I have probably 1/3 of my common
available time since the coronavirus lockdown so I have not advanced
as expected. I hope to catch up rapidly and ping you. Our expected
target of merging it before the 2.21.0 release seems to be hard to get
at this point because the branch will be cut next week. I hope this is
not a problem but if it is please excuse me.

I also profit to ask any other Beamer that could have more free cycles
at the moment in case (s)he can give me an extra hand for the review.

Regards,
Ismaël


On Fri, Apr 3, 2020 at 4:16 AM Pulasthi Supun Wickramasinghe
 wrote:
>
> Hi Ismaël
>
> Did you get some free time to perform a code review on the pull request
>
> Best Regards
> Pulasthi
>
> On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik  wrote:
>>
>> I have to disagree. Allowing for runners within the Apache Beam repo and 
>> SDKs that reach into the implementation details of each other are usability, 
>> feature development, maintenance and complexity problems.
>>
>> The usability issue comes from our public core facing APIs exposing methods 
>> that runners "need" so they can introspect details that shouldn't be visible 
>> to them (e.g. setWindowingStrategyInternal on PCollection). Getting to 1 
>> would remove the pipeline construction time instances but not the execution 
>> side ones and there are currently 100+ usages of the @Internal annotation.
>>
>> The feature development and maintenance issues both stem from duplication of 
>> work. We need to have at least two copies of how to do something, one that 
>> is for runner -> SDK direct and one for Fn API. An example of this is the 
>> timer family work which was started and completed for the non portable 
>> implementation yet the portable implementation was left as future work.
>>
>> Finally, the complexity comes from how many layers we have that wrap 
>> existing components to create variants for different use cases. I'm looking 
>> at all the DoFnRunners and each of their variants and how those have layers 
>> within themselves within the SDK and how additional layers have been made to 
>> interface with runner specific internal details.
>>
>>
>> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles  wrote:
>>>
>>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn 
>>> API). But I disagree with it as a requirement; the whole point of 
>>> ptransforms with URNs is that if the runner can figure out how to execute 
>>> it according to semantics, then it is fine. A runner meets (1) and (2) but 
>>> can only run certain subset of DoFns is allowed by design (whether the 
>>> subset is based on language, state/timer support, etc).
>>>
>>> Kenn
>>>
>>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik  wrote:

 I would like to move away from having runners access APIs that are related 
 to pipeline construction and other internal SDK APIs and I would like for 
 SDKs to not inspect internal runner APIs. This would enable the community 
 to improve each independently without needing to fix the world all the 
 time and would enable the community to run a cluster that supports 
 multiple Beam versions at the same time and would also allow for the 
 cluster to be updated independently of the pipelines it runs.

 As a community, I believe we need to achieve 1, 2 and 3. Outside of the 
 Apache Beam repo, anyone can do whatever they want but there should be no 
 compatibility guarantees.

 4 and 5 are extensions that enable a richer set of pipelines to run and 
 are optional like many other parts such as if a runner supports metrics 
 aggregation or dynamic work rebalancing.

 On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles  wrote:
>
> There are a lot of different meanings to "portable runner". Here are some:
>
> (1) A runner that accepts a pipeline proto and either runs it or says it 
> cannot run it
> (2) A runner that accepts jobs via the job management APIs
> (3) A runner that executes UDFs via the Fn API
> (4) A runner that can execute multiple languages
> (5) A runner that can run cross-language transforms aka multiple 
> languages in the same pipeline
>
> I think (1) is a very good bar, and (2) is a nice addition on top of 
> that. Then we have a unified way to submit pipelines and understand their 
> status.
>
> I think (3) is optional - a runner can run things however it likes, 
> including with native implementations. And then (4) and (5) as well are 
> just levels of feature capabilities.
>
> Kenn
>
> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:
>>
>> +1
>>
>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel  wrote:
>>>
>>> One last thing, for any runner after this one... wouldn't it be a good 
>>> acceptance criteria to only accept portable implementations anymore?
>>>
>>>  _/
>>> _/ Alex

Re: Contributing Twister2 runner to Apache Beam

2020-04-02 Thread Pulasthi Supun Wickramasinghe

Hi Ismaël

Did you get some free time to perform a code review on the pull request

Best Regards
Pulasthi

On Tue, Mar 10, 2020 at 3:30 PM Luke Cwik  wrote:

> I have to disagree. Allowing for runners within the Apache Beam repo and
> SDKs that reach into the implementation details of each other are
> usability, feature development, maintenance and complexity problems.
>
> The usability issue comes from our public core facing APIs exposing
> methods that runners "need" so they can introspect details that shouldn't
> be visible to them (e.g. setWindowingStrategyInternal on PCollection).
> Getting to 1 would remove the pipeline construction time instances but not
> the execution side ones and there are currently 100+ usages of
> the @Internal annotation.
>
> The feature development and maintenance issues both stem from duplication
> of work. We need to have at least two copies of how to do something, one
> that is for runner -> SDK direct and one for Fn API. An example of this is
> the timer family work which was started and completed for the non portable
> implementation yet the portable implementation was left as future work.
>
> Finally, the complexity comes from how many layers we have that wrap
> existing components to create variants for different use cases. I'm looking
> at all the DoFnRunners and each of their variants and how those have layers
> within themselves within the SDK and how additional layers have been made
> to interface with runner specific internal details.
>
>
> On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles  wrote:
>
>> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
>> API). But I disagree with it as a requirement; the whole point of
>> ptransforms with URNs is that if the runner can figure out how to execute
>> it according to semantics, then it is fine. A runner meets (1) and (2) but
>> can only run certain subset of DoFns is allowed by design (whether the
>> subset is based on language, state/timer support, etc).
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik  wrote:
>>
>>> I would like to move away from having runners access APIs that are
>>> related to pipeline construction and other internal SDK APIs and I would
>>> like for SDKs to not inspect internal runner APIs. This would enable the
>>> community to improve each independently without needing to fix the world
>>> all the time and would enable the community to run a cluster that supports
>>> multiple Beam versions at the same time and would also allow for the
>>> cluster to be updated independently of the pipelines it runs.
>>>
>>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
>>> Apache Beam repo, anyone can do whatever they want but there should be no
>>> compatibility guarantees.
>>>
>>> 4 and 5 are extensions that enable a richer set of pipelines to run and
>>> are optional like many other parts such as if a runner supports metrics
>>> aggregation or dynamic work rebalancing.
>>>
>>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles  wrote:
>>>
 There are a lot of different meanings to "portable runner". Here are
 some:

 (1) A runner that accepts a pipeline proto and either runs it or says
 it cannot run it
 (2) A runner that accepts jobs via the job management APIs
 (3) A runner that executes UDFs via the Fn API
 (4) A runner that can execute multiple languages
 (5) A runner that can run cross-language transforms aka multiple
 languages in the same pipeline

 I think (1) is a very good bar, and (2) is a nice addition on top of
 that. Then we have a unified way to submit pipelines and understand their
 status.

 I think (3) is optional - a runner can run things however it likes,
 including with native implementations. And then (4) and (5) as well are
 just levels of feature capabilities.

 Kenn

 On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:

> +1
>
> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel 
> wrote:
>
>> One last thing, for any runner after this one... wouldn't it be a
>> good acceptance criteria to only accept portable implementations anymore?
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía 
>> wrote:
>>
>>> Good points Kenn. I think we mostly agree on what has been discussed
>>> in this
>>> thread the pros/cons of having runners on our repository, but this
>>> is probably
>>> not the best moment in time to change any policy in that aspect.
>>>
>>> So if nobody objects I think we can proceed. I am OOO this week so
>>> with less
>>> time to continue with the code review, but I will be back to finish
>>> the review
>>> and hopefully finally get this merged with Pulasthi next week (sorry
>>> for the
>>> delay).
>>>
>>> > (don't wait for me on code review - if Ismaël said it is good,
>>> then it is

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Luke Cwik

I have to disagree. Allowing for runners within the Apache Beam repo and
SDKs that reach into the implementation details of each other are
usability, feature development, maintenance and complexity problems.

The usability issue comes from our public core facing APIs exposing methods
that runners "need" so they can introspect details that shouldn't be
visible to them (e.g. setWindowingStrategyInternal on PCollection). Getting
to 1 would remove the pipeline construction time instances but not the
execution side ones and there are currently 100+ usages of the @Internal
annotation.

The feature development and maintenance issues both stem from duplication
of work. We need to have at least two copies of how to do something, one
that is for runner -> SDK direct and one for Fn API. An example of this is
the timer family work which was started and completed for the non portable
implementation yet the portable implementation was left as future work.

Finally, the complexity comes from how many layers we have that wrap
existing components to create variants for different use cases. I'm looking
at all the DoFnRunners and each of their variants and how those have layers
within themselves within the SDK and how additional layers have been made
to interface with runner specific internal details.


On Tue, Mar 10, 2020 at 12:07 PM Kenneth Knowles  wrote:

> I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
> API). But I disagree with it as a requirement; the whole point of
> ptransforms with URNs is that if the runner can figure out how to execute
> it according to semantics, then it is fine. A runner meets (1) and (2) but
> can only run certain subset of DoFns is allowed by design (whether the
> subset is based on language, state/timer support, etc).
>
> Kenn
>
> On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik  wrote:
>
>> I would like to move away from having runners access APIs that are
>> related to pipeline construction and other internal SDK APIs and I would
>> like for SDKs to not inspect internal runner APIs. This would enable the
>> community to improve each independently without needing to fix the world
>> all the time and would enable the community to run a cluster that supports
>> multiple Beam versions at the same time and would also allow for the
>> cluster to be updated independently of the pipelines it runs.
>>
>> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
>> Apache Beam repo, anyone can do whatever they want but there should be no
>> compatibility guarantees.
>>
>> 4 and 5 are extensions that enable a richer set of pipelines to run and
>> are optional like many other parts such as if a runner supports metrics
>> aggregation or dynamic work rebalancing.
>>
>> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles  wrote:
>>
>>> There are a lot of different meanings to "portable runner". Here are
>>> some:
>>>
>>> (1) A runner that accepts a pipeline proto and either runs it or says it
>>> cannot run it
>>> (2) A runner that accepts jobs via the job management APIs
>>> (3) A runner that executes UDFs via the Fn API
>>> (4) A runner that can execute multiple languages
>>> (5) A runner that can run cross-language transforms aka multiple
>>> languages in the same pipeline
>>>
>>> I think (1) is a very good bar, and (2) is a nice addition on top of
>>> that. Then we have a unified way to submit pipelines and understand their
>>> status.
>>>
>>> I think (3) is optional - a runner can run things however it likes,
>>> including with native implementations. And then (4) and (5) as well are
>>> just levels of feature capabilities.
>>>
>>> Kenn
>>>
>>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:
>>>
 +1

 On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel 
 wrote:

> One last thing, for any runner after this one... wouldn't it be a good
> acceptance criteria to only accept portable implementations anymore?
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía 
> wrote:
>
>> Good points Kenn. I think we mostly agree on what has been discussed
>> in this
>> thread the pros/cons of having runners on our repository, but this is
>> probably
>> not the best moment in time to change any policy in that aspect.
>>
>> So if nobody objects I think we can proceed. I am OOO this week so
>> with less
>> time to continue with the code review, but I will be back to finish
>> the review
>> and hopefully finally get this merged with Pulasthi next week (sorry
>> for the
>> delay).
>>
>> > (don't wait for me on code review - if Ismaël said it is good, then
>> it is
>> > good.)
>>
>> Thanks for your confidence. Twister2 runners looks good so far, but I
>> will
>> confirm 100% next week :) In the meantime if someone has some extra
>> cycles to
>> take a look extra feedback is always welcome.
>>
>> On Mon, Mar 9,

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Kenneth Knowles

I do support all the efforts to get Dataflow, Flink, and Spark to 3 (Fn
API). But I disagree with it as a requirement; the whole point of
ptransforms with URNs is that if the runner can figure out how to execute
it according to semantics, then it is fine. A runner meets (1) and (2) but
can only run certain subset of DoFns is allowed by design (whether the
subset is based on language, state/timer support, etc).

Kenn

On Tue, Mar 10, 2020 at 9:45 AM Luke Cwik  wrote:

> I would like to move away from having runners access APIs that are related
> to pipeline construction and other internal SDK APIs and I would like for
> SDKs to not inspect internal runner APIs. This would enable the community
> to improve each independently without needing to fix the world all the time
> and would enable the community to run a cluster that supports multiple Beam
> versions at the same time and would also allow for the cluster to be
> updated independently of the pipelines it runs.
>
> As a community, I believe we need to achieve 1, 2 and 3. Outside of the
> Apache Beam repo, anyone can do whatever they want but there should be no
> compatibility guarantees.
>
> 4 and 5 are extensions that enable a richer set of pipelines to run and
> are optional like many other parts such as if a runner supports metrics
> aggregation or dynamic work rebalancing.
>
> On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles  wrote:
>
>> There are a lot of different meanings to "portable runner". Here are some:
>>
>> (1) A runner that accepts a pipeline proto and either runs it or says it
>> cannot run it
>> (2) A runner that accepts jobs via the job management APIs
>> (3) A runner that executes UDFs via the Fn API
>> (4) A runner that can execute multiple languages
>> (5) A runner that can run cross-language transforms aka multiple
>> languages in the same pipeline
>>
>> I think (1) is a very good bar, and (2) is a nice addition on top of
>> that. Then we have a unified way to submit pipelines and understand their
>> status.
>>
>> I think (3) is optional - a runner can run things however it likes,
>> including with native implementations. And then (4) and (5) as well are
>> just levels of feature capabilities.
>>
>> Kenn
>>
>> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:
>>
>>> +1
>>>
>>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel 
>>> wrote:
>>>
 One last thing, for any runner after this one... wouldn't it be a good
 acceptance criteria to only accept portable implementations anymore?

  _/
 _/ Alex Van Boxel


 On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía  wrote:

> Good points Kenn. I think we mostly agree on what has been discussed
> in this
> thread the pros/cons of having runners on our repository, but this is
> probably
> not the best moment in time to change any policy in that aspect.
>
> So if nobody objects I think we can proceed. I am OOO this week so
> with less
> time to continue with the code review, but I will be back to finish
> the review
> and hopefully finally get this merged with Pulasthi next week (sorry
> for the
> delay).
>
> > (don't wait for me on code review - if Ismaël said it is good, then
> it is
> > good.)
>
> Thanks for your confidence. Twister2 runners looks good so far, but I
> will
> confirm 100% next week :) In the meantime if someone has some extra
> cycles to
> take a look extra feedback is always welcome.
>
> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles 
> wrote:
> >
> > I haven't heard anyone suggest that we need a vote. I haven't heard
> anyone object to this being merged to master. Some time ago, we mostly
> decided to favor master instead of branches, because it is so much 
> smoother
> for contributors and users.
> >
> > So I am poking this thread one last time and otherwise I would
> consider it consensus that once code review is done the runner is a part 
> of
> Beam (experimental!).
> >
> > (don't wait for me on code review - if Ismaël said it is good, then
> it is good.)
> >
> > Kenn
> >
> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
> >>
> >> I understand that the discussion is on a more broad level than the
> Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Luke Cwik

I would like to move away from having runners access APIs that are related
to pipeline construction and other internal SDK APIs and I would like for
SDKs to not inspect internal runner APIs. This would enable the community
to improve each independently without needing to fix the world all the time
and would enable the community to run a cluster that supports multiple Beam
versions at the same time and would also allow for the cluster to be
updated independently of the pipelines it runs.

As a community, I believe we need to achieve 1, 2 and 3. Outside of the
Apache Beam repo, anyone can do whatever they want but there should be no
compatibility guarantees.

4 and 5 are extensions that enable a richer set of pipelines to run and are
optional like many other parts such as if a runner supports metrics
aggregation or dynamic work rebalancing.

On Tue, Mar 10, 2020 at 9:11 AM Kenneth Knowles  wrote:

> There are a lot of different meanings to "portable runner". Here are some:
>
> (1) A runner that accepts a pipeline proto and either runs it or says it
> cannot run it
> (2) A runner that accepts jobs via the job management APIs
> (3) A runner that executes UDFs via the Fn API
> (4) A runner that can execute multiple languages
> (5) A runner that can run cross-language transforms aka multiple languages
> in the same pipeline
>
> I think (1) is a very good bar, and (2) is a nice addition on top of that.
> Then we have a unified way to submit pipelines and understand their status.
>
> I think (3) is optional - a runner can run things however it likes,
> including with native implementations. And then (4) and (5) as well are
> just levels of feature capabilities.
>
> Kenn
>
> On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:
>
>> +1
>>
>> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel  wrote:
>>
>>> One last thing, for any runner after this one... wouldn't it be a good
>>> acceptance criteria to only accept portable implementations anymore?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía  wrote:
>>>
 Good points Kenn. I think we mostly agree on what has been discussed in
 this
 thread the pros/cons of having runners on our repository, but this is
 probably
 not the best moment in time to change any policy in that aspect.

 So if nobody objects I think we can proceed. I am OOO this week so with
 less
 time to continue with the code review, but I will be back to finish the
 review
 and hopefully finally get this merged with Pulasthi next week (sorry
 for the
 delay).

 > (don't wait for me on code review - if Ismaël said it is good, then
 it is
 > good.)

 Thanks for your confidence. Twister2 runners looks good so far, but I
 will
 confirm 100% next week :) In the meantime if someone has some extra
 cycles to
 take a look extra feedback is always welcome.

 On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles  wrote:
 >
 > I haven't heard anyone suggest that we need a vote. I haven't heard
 anyone object to this being merged to master. Some time ago, we mostly
 decided to favor master instead of branches, because it is so much smoother
 for contributors and users.
 >
 > So I am poking this thread one last time and otherwise I would
 consider it consensus that once code review is done the runner is a part of
 Beam (experimental!).
 >
 > (don't wait for me on code review - if Ismaël said it is good, then
 it is good.)
 >
 > Kenn
 >
 > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
 pulasthi...@gmail.com> wrote:
 >>
 >> I understand that the discussion is on a more broad level than the
 Twister2 runner. From my experience developing the runner the main
 advantage of being inside the beam project was the easy access to the wide
 range of tests and other core/utility code as Kyle pointed out. Unmerging
 runners that are not properly maintained and updated would be the most
 logical path to follow since the internals of the runners are only well
 understood by developers of that particular project. It would be
 unreasonable to expect the Beam community to maintain them. And since the
 runners do not alter the core API's I assume they would be easy to unmerge
 if the need arises.
 >>
 >> Talking specifically about Twister2 runner, we hope to continue
 developing the runner in the future to add both streaming capability and
 develop a portable runner as well. The team behind Twister2 is working
 towards the goal to get the project into Apache Incubator in the near
 future (Hopefully to submit the proposal in the next couple of months).
 >>
 >> Best Regards,
 >> Pulasthi
 >>
 >>
 >>
 >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
 wrote:
 >>>
 >>> I think we will get to a point where it makes sense

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Kenneth Knowles

There are a lot of different meanings to "portable runner". Here are some:

(1) A runner that accepts a pipeline proto and either runs it or says it
cannot run it
(2) A runner that accepts jobs via the job management APIs
(3) A runner that executes UDFs via the Fn API
(4) A runner that can execute multiple languages
(5) A runner that can run cross-language transforms aka multiple languages
in the same pipeline

I think (1) is a very good bar, and (2) is a nice addition on top of that.
Then we have a unified way to submit pipelines and understand their status.

I think (3) is optional - a runner can run things however it likes,
including with native implementations. And then (4) and (5) as well are
just levels of feature capabilities.

Kenn

On Tue, Mar 10, 2020 at 8:54 AM Luke Cwik  wrote:

> +1
>
> On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel  wrote:
>
>> One last thing, for any runner after this one... wouldn't it be a good
>> acceptance criteria to only accept portable implementations anymore?
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía  wrote:
>>
>>> Good points Kenn. I think we mostly agree on what has been discussed in
>>> this
>>> thread the pros/cons of having runners on our repository, but this is
>>> probably
>>> not the best moment in time to change any policy in that aspect.
>>>
>>> So if nobody objects I think we can proceed. I am OOO this week so with
>>> less
>>> time to continue with the code review, but I will be back to finish the
>>> review
>>> and hopefully finally get this merged with Pulasthi next week (sorry for
>>> the
>>> delay).
>>>
>>> > (don't wait for me on code review - if Ismaël said it is good, then it
>>> is
>>> > good.)
>>>
>>> Thanks for your confidence. Twister2 runners looks good so far, but I
>>> will
>>> confirm 100% next week :) In the meantime if someone has some extra
>>> cycles to
>>> take a look extra feedback is always welcome.
>>>
>>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles  wrote:
>>> >
>>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>>> anyone object to this being merged to master. Some time ago, we mostly
>>> decided to favor master instead of branches, because it is so much smoother
>>> for contributors and users.
>>> >
>>> > So I am poking this thread one last time and otherwise I would
>>> consider it consensus that once code review is done the runner is a part of
>>> Beam (experimental!).
>>> >
>>> > (don't wait for me on code review - if Ismaël said it is good, then it
>>> is good.)
>>> >
>>> > Kenn
>>> >
>>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>>> pulasthi...@gmail.com> wrote:
>>> >>
>>> >> I understand that the discussion is on a more broad level than the
>>> Twister2 runner. From my experience developing the runner the main
>>> advantage of being inside the beam project was the easy access to the wide
>>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>>> runners that are not properly maintained and updated would be the most
>>> logical path to follow since the internals of the runners are only well
>>> understood by developers of that particular project. It would be
>>> unreasonable to expect the Beam community to maintain them. And since the
>>> runners do not alter the core API's I assume they would be easy to unmerge
>>> if the need arises.
>>> >>
>>> >> Talking specifically about Twister2 runner, we hope to continue
>>> developing the runner in the future to add both streaming capability and
>>> develop a portable runner as well. The team behind Twister2 is working
>>> towards the goal to get the project into Apache Incubator in the near
>>> future (Hopefully to submit the proposal in the next couple of months).
>>> >>
>>> >> Best Regards,
>>> >> Pulasthi
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
>>> wrote:
>>> >>>
>>> >>> I think we will get to a point where it makes sense for runners to
>>> >>> live in their own repositories, with their own release cadence, but
>>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>>> >>> closing in on that with the portability protos, but many (java)
>>> >>> runners actually share the common runner core libraries and that is
>>> >>> even less set in stone.
>>> >>>
>>> >>> On the other hand, taking responsibility for maintaining all runners
>>> >>> is not a tenable or scalable position for the Beam project. If a
>>> >>> runner is merged, it should be understood that it can be "un-merged"
>>> >>> if it causes a maintenance burden. A completely separate
>>> >>> project/repository makes this less messy.
>>> >>>
>>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles 
>>> wrote:
>>> >>> >
>>> >>> > I agree with both of you, mostly :-)
>>> >>> >
>>> >>> > The monorepo approach doesn't work/scale well for shipped
>>> libraries (name a Google library that silently just works and never causes
>>> any dependency problems) and the

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Luke Cwik

+1

On Tue, Mar 10, 2020 at 12:59 AM Alex Van Boxel  wrote:

> One last thing, for any runner after this one... wouldn't it be a good
> acceptance criteria to only accept portable implementations anymore?
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía  wrote:
>
>> Good points Kenn. I think we mostly agree on what has been discussed in
>> this
>> thread the pros/cons of having runners on our repository, but this is
>> probably
>> not the best moment in time to change any policy in that aspect.
>>
>> So if nobody objects I think we can proceed. I am OOO this week so with
>> less
>> time to continue with the code review, but I will be back to finish the
>> review
>> and hopefully finally get this merged with Pulasthi next week (sorry for
>> the
>> delay).
>>
>> > (don't wait for me on code review - if Ismaël said it is good, then it
>> is
>> > good.)
>>
>> Thanks for your confidence. Twister2 runners looks good so far, but I will
>> confirm 100% next week :) In the meantime if someone has some extra
>> cycles to
>> take a look extra feedback is always welcome.
>>
>> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles  wrote:
>> >
>> > I haven't heard anyone suggest that we need a vote. I haven't heard
>> anyone object to this being merged to master. Some time ago, we mostly
>> decided to favor master instead of branches, because it is so much smoother
>> for contributors and users.
>> >
>> > So I am poking this thread one last time and otherwise I would consider
>> it consensus that once code review is done the runner is a part of Beam
>> (experimental!).
>> >
>> > (don't wait for me on code review - if Ismaël said it is good, then it
>> is good.)
>> >
>> > Kenn
>> >
>> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
>> pulasthi...@gmail.com> wrote:
>> >>
>> >> I understand that the discussion is on a more broad level than the
>> Twister2 runner. From my experience developing the runner the main
>> advantage of being inside the beam project was the easy access to the wide
>> range of tests and other core/utility code as Kyle pointed out. Unmerging
>> runners that are not properly maintained and updated would be the most
>> logical path to follow since the internals of the runners are only well
>> understood by developers of that particular project. It would be
>> unreasonable to expect the Beam community to maintain them. And since the
>> runners do not alter the core API's I assume they would be easy to unmerge
>> if the need arises.
>> >>
>> >> Talking specifically about Twister2 runner, we hope to continue
>> developing the runner in the future to add both streaming capability and
>> develop a portable runner as well. The team behind Twister2 is working
>> towards the goal to get the project into Apache Incubator in the near
>> future (Hopefully to submit the proposal in the next couple of months).
>> >>
>> >> Best Regards,
>> >> Pulasthi
>> >>
>> >>
>> >>
>> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> I think we will get to a point where it makes sense for runners to
>> >>> live in their own repositories, with their own release cadence, but
>> >>> we're not at that point yet. One prerequisite is a stable API--we're
>> >>> closing in on that with the portability protos, but many (java)
>> >>> runners actually share the common runner core libraries and that is
>> >>> even less set in stone.
>> >>>
>> >>> On the other hand, taking responsibility for maintaining all runners
>> >>> is not a tenable or scalable position for the Beam project. If a
>> >>> runner is merged, it should be understood that it can be "un-merged"
>> >>> if it causes a maintenance burden. A completely separate
>> >>> project/repository makes this less messy.
>> >>>
>> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles 
>> wrote:
>> >>> >
>> >>> > I agree with both of you, mostly :-)
>> >>> >
>> >>> > The monorepo approach doesn't work/scale well for shipped libraries
>> (name a Google library that silently just works and never causes any
>> dependency problems) and the pain we feel has been constant and increasing,
>> but I don't think we are at the breaking point.
>> >>> >
>> >>> > But Google's big monorepo [1] demonstrates similar benefits to what
>> Kyle describes. In the early stages the benefit of not having to think too
>> hard about build/test infra and share it everywhere is a big help, and it
>> scales well. Eventually, shipping test utility libraries and compliance
>> suites can be equivalent. And to your point - it is very helpful for users
>> to know that they can use CassandraIO with the other Beam artifacts. This
>> is why Google requires the whole big repo to depend on a single version of
>> any externally-controlled artifact. But, yes, as a consequence it is
>> preposterously difficult to stay up to date, since literally anything can
>> block progress. You need a unified escalation chain for that policy to make
>> sense. It is the definition of a

Re: Contributing Twister2 runner to Apache Beam

2020-03-10 Thread Alex Van Boxel

One last thing, for any runner after this one... wouldn't it be a good
acceptance criteria to only accept portable implementations anymore?

 _/
_/ Alex Van Boxel


On Mon, Mar 9, 2020 at 10:42 PM Ismaël Mejía  wrote:

> Good points Kenn. I think we mostly agree on what has been discussed in
> this
> thread the pros/cons of having runners on our repository, but this is
> probably
> not the best moment in time to change any policy in that aspect.
>
> So if nobody objects I think we can proceed. I am OOO this week so with
> less
> time to continue with the code review, but I will be back to finish the
> review
> and hopefully finally get this merged with Pulasthi next week (sorry for
> the
> delay).
>
> > (don't wait for me on code review - if Ismaël said it is good, then it is
> > good.)
>
> Thanks for your confidence. Twister2 runners looks good so far, but I will
> confirm 100% next week :) In the meantime if someone has some extra cycles
> to
> take a look extra feedback is always welcome.
>
> On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles  wrote:
> >
> > I haven't heard anyone suggest that we need a vote. I haven't heard
> anyone object to this being merged to master. Some time ago, we mostly
> decided to favor master instead of branches, because it is so much smoother
> for contributors and users.
> >
> > So I am poking this thread one last time and otherwise I would consider
> it consensus that once code review is done the runner is a part of Beam
> (experimental!).
> >
> > (don't wait for me on code review - if Ismaël said it is good, then it
> is good.)
> >
> > Kenn
> >
> > On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
> >>
> >> I understand that the discussion is on a more broad level than the
> Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
> >>
> >> Talking specifically about Twister2 runner, we hope to continue
> developing the runner in the future to add both streaming capability and
> develop a portable runner as well. The team behind Twister2 is working
> towards the goal to get the project into Apache Incubator in the near
> future (Hopefully to submit the proposal in the next couple of months).
> >>
> >> Best Regards,
> >> Pulasthi
> >>
> >>
> >>
> >> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
> wrote:
> >>>
> >>> I think we will get to a point where it makes sense for runners to
> >>> live in their own repositories, with their own release cadence, but
> >>> we're not at that point yet. One prerequisite is a stable API--we're
> >>> closing in on that with the portability protos, but many (java)
> >>> runners actually share the common runner core libraries and that is
> >>> even less set in stone.
> >>>
> >>> On the other hand, taking responsibility for maintaining all runners
> >>> is not a tenable or scalable position for the Beam project. If a
> >>> runner is merged, it should be understood that it can be "un-merged"
> >>> if it causes a maintenance burden. A completely separate
> >>> project/repository makes this less messy.
> >>>
> >>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles 
> wrote:
> >>> >
> >>> > I agree with both of you, mostly :-)
> >>> >
> >>> > The monorepo approach doesn't work/scale well for shipped libraries
> (name a Google library that silently just works and never causes any
> dependency problems) and the pain we feel has been constant and increasing,
> but I don't think we are at the breaking point.
> >>> >
> >>> > But Google's big monorepo [1] demonstrates similar benefits to what
> Kyle describes. In the early stages the benefit of not having to think too
> hard about build/test infra and share it everywhere is a big help, and it
> scales well. Eventually, shipping test utility libraries and compliance
> suites can be equivalent. And to your point - it is very helpful for users
> to know that they can use CassandraIO with the other Beam artifacts. This
> is why Google requires the whole big repo to depend on a single version of
> any externally-controlled artifact. But, yes, as a consequence it is
> preposterously difficult to stay up to date, since literally anything can
> block progress. You need a unified escalation chain for that policy to make
> sense. It is the definition of a healthy Apache project to *not* have that
> (PMC is different).
> >>> >
> >>> > Independent dependencies, independent git histories, and independent
> release cadence/process

Re: Contributing Twister2 runner to Apache Beam

2020-03-09 Thread Ismaël Mejía

Good points Kenn. I think we mostly agree on what has been discussed in this
thread the pros/cons of having runners on our repository, but this is probably
not the best moment in time to change any policy in that aspect.

So if nobody objects I think we can proceed. I am OOO this week so with less
time to continue with the code review, but I will be back to finish the review
and hopefully finally get this merged with Pulasthi next week (sorry for the
delay).

> (don't wait for me on code review - if Ismaël said it is good, then it is
> good.)

Thanks for your confidence. Twister2 runners looks good so far, but I will
confirm 100% next week :) In the meantime if someone has some extra cycles to
take a look extra feedback is always welcome.

On Mon, Mar 9, 2020 at 5:50 AM Kenneth Knowles  wrote:
>
> I haven't heard anyone suggest that we need a vote. I haven't heard anyone 
> object to this being merged to master. Some time ago, we mostly decided to 
> favor master instead of branches, because it is so much smoother for 
> contributors and users.
>
> So I am poking this thread one last time and otherwise I would consider it 
> consensus that once code review is done the runner is a part of Beam 
> (experimental!).
>
> (don't wait for me on code review - if Ismaël said it is good, then it is 
> good.)
>
> Kenn
>
> On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe 
>  wrote:
>>
>> I understand that the discussion is on a more broad level than the Twister2 
>> runner. From my experience developing the runner the main advantage of being 
>> inside the beam project was the easy access to the wide range of tests and 
>> other core/utility code as Kyle pointed out. Unmerging runners that are not 
>> properly maintained and updated would be the most logical path to follow 
>> since the internals of the runners are only well understood by developers of 
>> that particular project. It would be unreasonable to expect the Beam 
>> community to maintain them. And since the runners do not alter the core 
>> API's I assume they would be easy to unmerge if the need arises.
>>
>> Talking specifically about Twister2 runner, we hope to continue developing 
>> the runner in the future to add both streaming capability and develop a 
>> portable runner as well. The team behind Twister2 is working towards the 
>> goal to get the project into Apache Incubator in the near future (Hopefully 
>> to submit the proposal in the next couple of months).
>>
>> Best Regards,
>> Pulasthi
>>
>>
>>
>> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw  wrote:
>>>
>>> I think we will get to a point where it makes sense for runners to
>>> live in their own repositories, with their own release cadence, but
>>> we're not at that point yet. One prerequisite is a stable API--we're
>>> closing in on that with the portability protos, but many (java)
>>> runners actually share the common runner core libraries and that is
>>> even less set in stone.
>>>
>>> On the other hand, taking responsibility for maintaining all runners
>>> is not a tenable or scalable position for the Beam project. If a
>>> runner is merged, it should be understood that it can be "un-merged"
>>> if it causes a maintenance burden. A completely separate
>>> project/repository makes this less messy.
>>>
>>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles  wrote:
>>> >
>>> > I agree with both of you, mostly :-)
>>> >
>>> > The monorepo approach doesn't work/scale well for shipped libraries (name 
>>> > a Google library that silently just works and never causes any dependency 
>>> > problems) and the pain we feel has been constant and increasing, but I 
>>> > don't think we are at the breaking point.
>>> >
>>> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle 
>>> > describes. In the early stages the benefit of not having to think too 
>>> > hard about build/test infra and share it everywhere is a big help, and it 
>>> > scales well. Eventually, shipping test utility libraries and compliance 
>>> > suites can be equivalent. And to your point - it is very helpful for 
>>> > users to know that they can use CassandraIO with the other Beam 
>>> > artifacts. This is why Google requires the whole big repo to depend on a 
>>> > single version of any externally-controlled artifact. But, yes, as a 
>>> > consequence it is preposterously difficult to stay up to date, since 
>>> > literally anything can block progress. You need a unified escalation 
>>> > chain for that policy to make sense. It is the definition of a healthy 
>>> > Apache project to *not* have that (PMC is different).
>>> >
>>> > Independent dependencies, independent git histories, and independent 
>>> > release cadence/process are all separate discussions.
>>> >
>>> > It is a broader question than this particular contribution, so let's 
>>> > merge this runner before changing our whole way of doing things :-)
>>> >
>>> > Kenn
>>> >
>>> > [1] 
>>> >

Re: Contributing Twister2 runner to Apache Beam

2020-03-08 Thread Kenneth Knowles

I haven't heard anyone suggest that we need a vote. I haven't heard anyone
object to this being merged to master. Some time ago, we mostly decided to
favor master instead of branches, because it is so much smoother for
contributors and users.

So I am poking this thread one last time and otherwise I would consider it
consensus that once code review is done the runner is a part of Beam
(experimental!).

(don't wait for me on code review - if Ismaël said it is good, then it is
good.)

Kenn

On Fri, Mar 6, 2020 at 7:47 AM Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> I understand that the discussion is on a more broad level than the
> Twister2 runner. From my experience developing the runner the main
> advantage of being inside the beam project was the easy access to the wide
> range of tests and other core/utility code as Kyle pointed out. Unmerging
> runners that are not properly maintained and updated would be the most
> logical path to follow since the internals of the runners are only well
> understood by developers of that particular project. It would be
> unreasonable to expect the Beam community to maintain them. And since the
> runners do not alter the core API's I assume they would be easy to unmerge
> if the need arises.
>
> Talking specifically about Twister2 runner, we hope to continue developing
> the runner in the future to add both streaming capability and develop a
> portable runner as well. The team behind Twister2 is working towards the
> goal to get the project into Apache Incubator in the near future (Hopefully
> to submit the proposal in the next couple of months).
>
> Best Regards,
> Pulasthi
>
>
>
> On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw 
> wrote:
>
>> I think we will get to a point where it makes sense for runners to
>> live in their own repositories, with their own release cadence, but
>> we're not at that point yet. One prerequisite is a stable API--we're
>> closing in on that with the portability protos, but many (java)
>> runners actually share the common runner core libraries and that is
>> even less set in stone.
>>
>> On the other hand, taking responsibility for maintaining all runners
>> is not a tenable or scalable position for the Beam project. If a
>> runner is merged, it should be understood that it can be "un-merged"
>> if it causes a maintenance burden. A completely separate
>> project/repository makes this less messy.
>>
>> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles  wrote:
>> >
>> > I agree with both of you, mostly :-)
>> >
>> > The monorepo approach doesn't work/scale well for shipped libraries
>> (name a Google library that silently just works and never causes any
>> dependency problems) and the pain we feel has been constant and increasing,
>> but I don't think we are at the breaking point.
>> >
>> > But Google's big monorepo [1] demonstrates similar benefits to what
>> Kyle describes. In the early stages the benefit of not having to think too
>> hard about build/test infra and share it everywhere is a big help, and it
>> scales well. Eventually, shipping test utility libraries and compliance
>> suites can be equivalent. And to your point - it is very helpful for users
>> to know that they can use CassandraIO with the other Beam artifacts. This
>> is why Google requires the whole big repo to depend on a single version of
>> any externally-controlled artifact. But, yes, as a consequence it is
>> preposterously difficult to stay up to date, since literally anything can
>> block progress. You need a unified escalation chain for that policy to make
>> sense. It is the definition of a healthy Apache project to *not* have that
>> (PMC is different).
>> >
>> > Independent dependencies, independent git histories, and independent
>> release cadence/process are all separate discussions.
>> >
>> > It is a broader question than this particular contribution, so let's
>> merge this runner before changing our whole way of doing things :-)
>> >
>> > Kenn
>> >
>> > [1]
>> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>> (really quite a balanced analysis)
>> >
>> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver 
>> wrote:
>> >>
>> >> > Should runners, current and future, be in the same repository as Beam
>> >> > core?
>> >>
>> >> In the distant past, runners lived in their own repositories, and then
>> were donated to Beam. But Beam's current uber-repo setup allows a lot of
>> convenience. For example, a ton of code (including core functionality and
>> tests) is shared directly between runners, which is useful for keeping
>> runners up to date and ensuring consistent behavior between them (in other
>> words, maintainable and reliable).
>> >>
>> >> Generally, it is up to the authors of a particular Beam related
>> project/subproject to decide whether to host their code in Beam or in a
>> different repo, and up to the community to decide whether to take on the
>> donation, as

Re: Contributing Twister2 runner to Apache Beam

2020-03-06 Thread Pulasthi Supun Wickramasinghe

I understand that the discussion is on a more broad level than the Twister2
runner. From my experience developing the runner the main advantage of
being inside the beam project was the easy access to the wide range of
tests and other core/utility code as Kyle pointed out. Unmerging runners
that are not properly maintained and updated would be the most logical path
to follow since the internals of the runners are only well understood by
developers of that particular project. It would be unreasonable to expect
the Beam community to maintain them. And since the runners do not alter the
core API's I assume they would be easy to unmerge if the need arises.

Talking specifically about Twister2 runner, we hope to continue developing
the runner in the future to add both streaming capability and develop a
portable runner as well. The team behind Twister2 is working towards the
goal to get the project into Apache Incubator in the near future (Hopefully
to submit the proposal in the next couple of months).

Best Regards,
Pulasthi



On Thu, Mar 5, 2020 at 6:56 PM Robert Bradshaw  wrote:

> I think we will get to a point where it makes sense for runners to
> live in their own repositories, with their own release cadence, but
> we're not at that point yet. One prerequisite is a stable API--we're
> closing in on that with the portability protos, but many (java)
> runners actually share the common runner core libraries and that is
> even less set in stone.
>
> On the other hand, taking responsibility for maintaining all runners
> is not a tenable or scalable position for the Beam project. If a
> runner is merged, it should be understood that it can be "un-merged"
> if it causes a maintenance burden. A completely separate
> project/repository makes this less messy.
>
> On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles  wrote:
> >
> > I agree with both of you, mostly :-)
> >
> > The monorepo approach doesn't work/scale well for shipped libraries
> (name a Google library that silently just works and never causes any
> dependency problems) and the pain we feel has been constant and increasing,
> but I don't think we are at the breaking point.
> >
> > But Google's big monorepo [1] demonstrates similar benefits to what Kyle
> describes. In the early stages the benefit of not having to think too hard
> about build/test infra and share it everywhere is a big help, and it scales
> well. Eventually, shipping test utility libraries and compliance suites can
> be equivalent. And to your point - it is very helpful for users to know
> that they can use CassandraIO with the other Beam artifacts. This is why
> Google requires the whole big repo to depend on a single version of any
> externally-controlled artifact. But, yes, as a consequence it is
> preposterously difficult to stay up to date, since literally anything can
> block progress. You need a unified escalation chain for that policy to make
> sense. It is the definition of a healthy Apache project to *not* have that
> (PMC is different).
> >
> > Independent dependencies, independent git histories, and independent
> release cadence/process are all separate discussions.
> >
> > It is a broader question than this particular contribution, so let's
> merge this runner before changing our whole way of doing things :-)
> >
> > Kenn
> >
> > [1]
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
> (really quite a balanced analysis)
> >
> > On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver  wrote:
> >>
> >> > Should runners, current and future, be in the same repository as Beam
> >> > core?
> >>
> >> In the distant past, runners lived in their own repositories, and then
> were donated to Beam. But Beam's current uber-repo setup allows a lot of
> convenience. For example, a ton of code (including core functionality and
> tests) is shared directly between runners, which is useful for keeping
> runners up to date and ensuring consistent behavior between them (in other
> words, maintainable and reliable).
> >>
> >> Generally, it is up to the authors of a particular Beam related
> project/subproject to decide whether to host their code in Beam or in a
> different repo, and up to the community to decide whether to take on the
> donation, as discussed in previous threads on the Twister2 runner. In this
> case, it seems there is agreement between the Twister2 runner authors and
> the community that the runner can be hosted in Beam proper.
> >>
> >> There are examples of successful independent Beam projects, such as
> Spotify's Scio, but having an independent project with its own releases
> requires a lot of dedicated resources, and the bar for entry for extending
> Beam should not be that high. All that's required of subproject authors is
> that they keep the subproject in step with Beam. If they can't maintain it
> any longer, the subproject can be allowed to bitrot without getting in
> anyone's way. On the other hand, I'm not sure

Re: Contributing Twister2 runner to Apache Beam

2020-03-05 Thread Robert Bradshaw

I think we will get to a point where it makes sense for runners to
live in their own repositories, with their own release cadence, but
we're not at that point yet. One prerequisite is a stable API--we're
closing in on that with the portability protos, but many (java)
runners actually share the common runner core libraries and that is
even less set in stone.

On the other hand, taking responsibility for maintaining all runners
is not a tenable or scalable position for the Beam project. If a
runner is merged, it should be understood that it can be "un-merged"
if it causes a maintenance burden. A completely separate
project/repository makes this less messy.

On Thu, Mar 5, 2020 at 10:01 AM Kenneth Knowles  wrote:
>
> I agree with both of you, mostly :-)
>
> The monorepo approach doesn't work/scale well for shipped libraries (name a 
> Google library that silently just works and never causes any dependency 
> problems) and the pain we feel has been constant and increasing, but I don't 
> think we are at the breaking point.
>
> But Google's big monorepo [1] demonstrates similar benefits to what Kyle 
> describes. In the early stages the benefit of not having to think too hard 
> about build/test infra and share it everywhere is a big help, and it scales 
> well. Eventually, shipping test utility libraries and compliance suites can 
> be equivalent. And to your point - it is very helpful for users to know that 
> they can use CassandraIO with the other Beam artifacts. This is why Google 
> requires the whole big repo to depend on a single version of any 
> externally-controlled artifact. But, yes, as a consequence it is 
> preposterously difficult to stay up to date, since literally anything can 
> block progress. You need a unified escalation chain for that policy to make 
> sense. It is the definition of a healthy Apache project to *not* have that 
> (PMC is different).
>
> Independent dependencies, independent git histories, and independent release 
> cadence/process are all separate discussions.
>
> It is a broader question than this particular contribution, so let's merge 
> this runner before changing our whole way of doing things :-)
>
> Kenn
>
> [1] 
> https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
>  (really quite a balanced analysis)
>
> On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver  wrote:
>>
>> > Should runners, current and future, be in the same repository as Beam
>> > core?
>>
>> In the distant past, runners lived in their own repositories, and then were 
>> donated to Beam. But Beam's current uber-repo setup allows a lot of 
>> convenience. For example, a ton of code (including core functionality and 
>> tests) is shared directly between runners, which is useful for keeping 
>> runners up to date and ensuring consistent behavior between them (in other 
>> words, maintainable and reliable).
>>
>> Generally, it is up to the authors of a particular Beam related 
>> project/subproject to decide whether to host their code in Beam or in a 
>> different repo, and up to the community to decide whether to take on the 
>> donation, as discussed in previous threads on the Twister2 runner. In this 
>> case, it seems there is agreement between the Twister2 runner authors and 
>> the community that the runner can be hosted in Beam proper.
>>
>> There are examples of successful independent Beam projects, such as 
>> Spotify's Scio, but having an independent project with its own releases 
>> requires a lot of dedicated resources, and the bar for entry for extending 
>> Beam should not be that high. All that's required of subproject authors is 
>> that they keep the subproject in step with Beam. If they can't maintain it 
>> any longer, the subproject can be allowed to bitrot without getting in 
>> anyone's way. On the other hand, I'm not sure of the details with Cassandra, 
>> but in general, a subproject should not have "the ability to block progress" 
>> just because it is contained in the Beam uber-repo.
>>
>> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are 
>> few enough to be handled on a case-by-case basis.
>>
>> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold  
>> wrote:
>>>
>>> Generic question without commenting on Twister2 specifically:
>>>
>>> Should runners, current and future, be in the same repository as Beam
>>> core? Can or should they be completely separate products with their
>>> own release cycles?
>>>
>>> Generally, loose coupling leads to more maintainable, reliable
>>> projects. Specifically, Cassandra is holding back some other changes
>>> in Beam and I really wish it didn't have the ability to block
>>> progress. The more different runners we have in core, the worse this
>>> problem is likely to become.
>>>
>>>
>>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>>  wrote:
>>> >
>>> > Hi
>>> >
>>> > I believe the pull request is pretty complete now with the help of 
>>> >

Re: Contributing Twister2 runner to Apache Beam

2020-03-05 Thread Kenneth Knowles

I agree with both of you, mostly :-)

The monorepo approach doesn't work/scale well for shipped libraries (name a
Google library that silently just works and never causes any dependency
problems) and the pain we feel has been constant and increasing, but I
don't think we are at the breaking point.

But Google's big monorepo [1] demonstrates similar benefits to what Kyle
describes. In the early stages the benefit of not having to think too hard
about build/test infra and share it everywhere is a big help, and it scales
well. Eventually, shipping test utility libraries and compliance suites can
be equivalent. And to your point - it is very helpful for users to know
that they can use CassandraIO with the other Beam artifacts. This is why
Google requires the whole big repo to depend on a single version of any
externally-controlled artifact. But, yes, as a consequence it is
preposterously difficult to stay up to date, since literally anything can
block progress. You need a unified escalation chain for that policy to make
sense. It is the definition of a healthy Apache project to *not* have that
(PMC is different).

Independent dependencies, independent git histories, and independent
release cadence/process are all separate discussions.

It is a broader question than this particular contribution, so let's merge
this runner before changing our whole way of doing things :-)

Kenn

[1]
https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
(really
quite a balanced analysis)

On Wed, Mar 4, 2020 at 11:51 AM Kyle Weaver  wrote:

> > Should runners, current and future, be in the same repository as Beam
> > core?
>
> In the distant past, runners lived in their own repositories, and then
> were donated to Beam. But Beam's current uber-repo setup allows a lot of
> convenience. For example, a ton of code (including core functionality and
> tests) is shared directly between runners, which is useful for keeping
> runners up to date and ensuring consistent behavior between them (in other
> words, maintainable and reliable).
>
> Generally, it is up to the authors of a particular Beam related
> project/subproject to decide whether to host their code in Beam or in a
> different repo, and up to the community to decide whether to take on the
> donation, as discussed in previous threads on the Twister2 runner. In this
> case, it seems there is agreement between the Twister2 runner authors and
> the community that the runner can be hosted in Beam proper.
>
> There are examples of successful independent Beam projects, such as
> Spotify's Scio, but having an independent project with its own releases
> requires a lot of dedicated resources, and the bar for entry for extending
> Beam should not be that high. All that's required of subproject authors is
> that they keep the subproject in step with Beam. If they can't maintain it
> any longer, the subproject can be allowed to bitrot without getting in
> anyone's way. On the other hand, I'm not sure of the details with
> Cassandra, but in general, a subproject should not have "the ability to
> block progress" just because it is contained in the Beam uber-repo.
>
> tl;dr Having an uber repo generally seems to work for Beam. Exceptions are
> few enough to be handled on a case-by-case basis.
>
> On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold 
> wrote:
>
>> Generic question without commenting on Twister2 specifically:
>>
>> Should runners, current and future, be in the same repository as Beam
>> core? Can or should they be completely separate products with their
>> own release cycles?
>>
>> Generally, loose coupling leads to more maintainable, reliable
>> projects. Specifically, Cassandra is holding back some other changes
>> in Beam and I really wish it didn't have the ability to block
>> progress. The more different runners we have in core, the worse this
>> problem is likely to become.
>>
>>
>> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>>  wrote:
>> >
>> > Hi
>> >
>> > I believe the pull request is pretty complete now with the help of
>> Ismaël. Kenn, would you be able to take a look at it and suggest any
>> changes if needed?. The build checks and validations tests are passing at
>> the moment.  I will start working on the documentation that you mentioned
>> in an earlier email separately.
>> >
>> > Best Regards,
>> > Pulasthi
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
>> pulasthi...@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I have created the initial pull request [1] to contribute the Twister2
>> Beam runner to the Apache Beam codebase. More information on Twister2 can
>> be found here[2] and the Twister2 codebase is available here[3]. At the
>> moment only batch mode is supported in the runner, but we are planning to
>> add stream support and implement a portable runner for Twister2 in the near
>> future.
>> >>
>> >> As Kenn pointed out

Re: Contributing Twister2 runner to Apache Beam

2020-03-04 Thread Kyle Weaver

> Should runners, current and future, be in the same repository as Beam
> core?

In the distant past, runners lived in their own repositories, and then were
donated to Beam. But Beam's current uber-repo setup allows a lot of
convenience. For example, a ton of code (including core functionality and
tests) is shared directly between runners, which is useful for keeping
runners up to date and ensuring consistent behavior between them (in other
words, maintainable and reliable).

Generally, it is up to the authors of a particular Beam related
project/subproject to decide whether to host their code in Beam or in a
different repo, and up to the community to decide whether to take on the
donation, as discussed in previous threads on the Twister2 runner. In this
case, it seems there is agreement between the Twister2 runner authors and
the community that the runner can be hosted in Beam proper.

There are examples of successful independent Beam projects, such as
Spotify's Scio, but having an independent project with its own releases
requires a lot of dedicated resources, and the bar for entry for extending
Beam should not be that high. All that's required of subproject authors is
that they keep the subproject in step with Beam. If they can't maintain it
any longer, the subproject can be allowed to bitrot without getting in
anyone's way. On the other hand, I'm not sure of the details with
Cassandra, but in general, a subproject should not have "the ability to
block progress" just because it is contained in the Beam uber-repo.

tl;dr Having an uber repo generally seems to work for Beam. Exceptions are
few enough to be handled on a case-by-case basis.

On Wed, Mar 4, 2020 at 11:12 AM Elliotte Rusty Harold 
wrote:

> Generic question without commenting on Twister2 specifically:
>
> Should runners, current and future, be in the same repository as Beam
> core? Can or should they be completely separate products with their
> own release cycles?
>
> Generally, loose coupling leads to more maintainable, reliable
> projects. Specifically, Cassandra is holding back some other changes
> in Beam and I really wish it didn't have the ability to block
> progress. The more different runners we have in core, the worse this
> problem is likely to become.
>
>
> On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
>  wrote:
> >
> > Hi
> >
> > I believe the pull request is pretty complete now with the help of
> Ismaël. Kenn, would you be able to take a look at it and suggest any
> changes if needed?. The build checks and validations tests are passing at
> the moment.  I will start working on the documentation that you mentioned
> in an earlier email separately.
> >
> > Best Regards,
> > Pulasthi
> >
> >
> >
> >
> >
> > On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
> pulasthi...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> I have created the initial pull request [1] to contribute the Twister2
> Beam runner to the Apache Beam codebase. More information on Twister2 can
> be found here[2] and the Twister2 codebase is available here[3]. At the
> moment only batch mode is supported in the runner, but we are planning to
> add stream support and implement a portable runner for Twister2 in the near
> future.
> >>
> >> As Kenn pointed out in an earlier email it would be great to have
> inputs from the community regarding this contribution since it is a sizable
> one. I am sure there are many improvements that can be done in the
> contributed codebase with input from the community.
> >>
> >> [1] https://github.com/apache/beam/pull/10888
> >> [2] https://twister2.org/
> >> [3] https://github.com/DSC-SPIDAL/twister2
> >>
> >> Best Regards,
> >> Pulasthi
> >> --
> >> Pulasthi S. Wickramasinghe
> >> PhD Candidate  | Research Assistant
> >> School of Informatics and Computing | Digital Science Center
> >> Indiana University, Bloomington
> >> cell: 224-386-9035 <(224)%20386-9035>
> >
> >
> >
> > --
> > Pulasthi S. Wickramasinghe
> > PhD Candidate  | Research Assistant
> > School of Informatics and Computing | Digital Science Center
> > Indiana University, Bloomington
> > cell: 224-386-9035 <(224)%20386-9035>
>
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>

Re: Contributing Twister2 runner to Apache Beam

2020-03-04 Thread Elliotte Rusty Harold

Generic question without commenting on Twister2 specifically:

Should runners, current and future, be in the same repository as Beam
core? Can or should they be completely separate products with their
own release cycles?

Generally, loose coupling leads to more maintainable, reliable
projects. Specifically, Cassandra is holding back some other changes
in Beam and I really wish it didn't have the ability to block
progress. The more different runners we have in core, the worse this
problem is likely to become.


On Wed, Mar 4, 2020 at 2:03 PM Pulasthi Supun Wickramasinghe
 wrote:
>
> Hi
>
> I believe the pull request is pretty complete now with the help of Ismaël. 
> Kenn, would you be able to take a look at it and suggest any changes if 
> needed?. The build checks and validations tests are passing at the moment.  I 
> will start working on the documentation that you mentioned in an earlier 
> email separately.
>
> Best Regards,
> Pulasthi
>
>
>
>
>
> On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe 
>  wrote:
>>
>> Hi All,
>>
>> I have created the initial pull request [1] to contribute the Twister2 Beam 
>> runner to the Apache Beam codebase. More information on Twister2 can be 
>> found here[2] and the Twister2 codebase is available here[3]. At the moment 
>> only batch mode is supported in the runner, but we are planning to add 
>> stream support and implement a portable runner for Twister2 in the near 
>> future.
>>
>> As Kenn pointed out in an earlier email it would be great to have inputs 
>> from the community regarding this contribution since it is a sizable one. I 
>> am sure there are many improvements that can be done in the contributed 
>> codebase with input from the community.
>>
>> [1] https://github.com/apache/beam/pull/10888
>> [2] https://twister2.org/
>> [3] https://github.com/DSC-SPIDAL/twister2
>>
>> Best Regards,
>> Pulasthi
>> --
>> Pulasthi S. Wickramasinghe
>> PhD Candidate  | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> cell: 224-386-9035
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035



-- 
Elliotte Rusty Harold
elh...@ibiblio.org

Re: Contributing Twister2 runner to Apache Beam

2020-03-04 Thread Pulasthi Supun Wickramasinghe

Hi

I believe the pull request is pretty complete now with the help of Ismaël.
Kenn, would you be able to take a look at it and suggest any changes if
needed?. The build checks and validations tests are passing at the moment.
I will start working on the documentation that you mentioned in an earlier
email separately.

Best Regards,
Pulasthi





On Tue, Feb 18, 2020 at 1:45 PM Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> Hi All,
>
> I have created the initial pull request [1] to contribute the Twister2
> Beam runner to the Apache Beam codebase. More information on Twister2 can
> be found here[2] and the Twister2 codebase is available here[3]. At the
> moment only batch mode is supported in the runner, but we are planning to
> add stream support and implement a portable runner for Twister2 in the near
> future.
>
> As Kenn pointed out in an earlier email it would be great to have inputs
> from the community regarding this contribution since it is a sizable one. I
> am sure there are many improvements that can be done in the contributed
> codebase with input from the community.
>
> [1] https://github.com/apache/beam/pull/10888
> [2] https://twister2.org/
> [3] https://github.com/DSC-SPIDAL/twister2
>
> Best Regards,
> Pulasthi
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035
>


-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Contributing Twister2 runner to Apache Beam

2020-02-18 Thread Pulasthi Supun Wickramasinghe

Hi All,

I have created the initial pull request [1] to contribute the Twister2 Beam
runner to the Apache Beam codebase. More information on Twister2 can be
found here[2] and the Twister2 codebase is available here[3]. At the moment
only batch mode is supported in the runner, but we are planning to add
stream support and implement a portable runner for Twister2 in the near
future.

As Kenn pointed out in an earlier email it would be great to have inputs
from the community regarding this contribution since it is a sizable one. I
am sure there are many improvements that can be done in the contributed
codebase with input from the community.

[1] https://github.com/apache/beam/pull/10888
[2] https://twister2.org/
[3] https://github.com/DSC-SPIDAL/twister2

Best Regards,
Pulasthi
-- 
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Re: Contributing Twister2 runner to Apache Beam

Contributing Twister2 runner to Apache Beam

18 matches

Site Navigation

Mail list logo

Footer information