Re: [BEAM-5442] Store duplicate unknown (runner) options in a list argument

2018-10-12 Thread Henning Rohde
Agree that pipeline options lack some mechanism for scoping. It is also not
always possible distinguish options meant to be consumed at pipeline
construction time, by the runner, by the SDK harness, by the user code or
any combination -- and this causes confusion every now and then.

For Dataflow, we have been using "experiments" for arbitrary
runner-specific options. It's simply a string list pipeline option that all
SDKs support and, for Go at least, is sent to portable runners. Flink can
do the same in the short term to move forward.

Henning


On Fri, Oct 12, 2018 at 8:50 AM Thomas Weise  wrote:

> [moving to the list]
>
> The requirement driving this part of the change was to allow a user to
> specify pipeline options that a runner supports without having to declare
> those in each language SDK.
>
> In the specific scenario, we have options that the Flink runner supports
> (and can validate), that are not enumerated in the Python SDK.
>
> I think we have a bigger problem scoping pipeline options. For example,
> the runner options are dumped into the SDK worker. There is also a
> possibility of name collisions. So I think this would benefit from broader
> feedback.
>
> Thanks,
> Thomas
>
>
> -- Forwarded message -
> From: Charles Chen 
> Date: Fri, Oct 12, 2018 at 8:36 AM
> Subject: Re: [apache/beam] [BEAM-5442] Store duplicate unknown options in
> a list argument (#6600)
> To: apache/beam 
> Cc: Thomas Weise , Mention <
> ment...@noreply.github.com>
>
>
> CC: @tweise 
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 
> .
>


Re: Add cleanup flag to DockerPayload

2018-10-03 Thread Henning Rohde
IMO it's the runner's responsibility to do container garbage collection and
disk space management. This flag seems like a implementation-specific
option that would not only to some runner/deployment combinations, so it
doesn't seem to belong in that proto. Dataflow would not be able to honor
such a flag, for example. Perhaps Flink could start all containers with
--rm and have a Flink-specific debug option to not do that?

Henning

On Wed, Oct 3, 2018 at 2:19 PM Ankur Goenka  wrote:

> Hi,
>
> In portable flink runner, SDK Harness docker containers are created
> dynamically and are not garbage collected. SDK Harness container pull the
> staging artifact, generate logs and tmp files which is stored as an
> additional layer on top of image.
> These dead container layers accumulates over time and make the machine go
> OOM. To avoid this situation and keep the flexibility of letting containers
> exist for debugging, I am planning to add cleanup flag to DockerPayload.
> When set, this flag will tell runner to remove the container after killing.
> Current DockerPayload:
>
> // The payload of a Docker image
> message DockerPayload {
>   string container_image = 1;  // implicitly linux_amd64.
> }
>
>
> Proposed DockerPayload:
>
> // The payload of a Docker image
> message DockerPayload {
>   string container_image = 1;  // implicitly linux_amd64.
>   bool cleanup = 2; // Flag to signal container deletion after killing.
> }
>
>
> Let me know your thoughts and if there is a better way to do this.
>
> Thanks,
> Ankur
>


Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-20 Thread Henning Rohde
Thank you Davor for your leadership and contributions to the project from
the very beginning. And thank you Kenn for accepting the PMC chair role.

Henning

On Thu, Sep 20, 2018 at 11:21 AM Mikhail Gryzykhin 
wrote:

> Contratulations Kenn!
>
> --Mikhail
>
> Have feedback ?
>
>
> On Thu, Sep 20, 2018 at 11:19 AM Scott Wegner  wrote:
>
>> Congrats, Kenn!
>>
>> On Thu, Sep 20, 2018 at 10:46 AM Daniel Oliveira 
>> wrote:
>>
>>> Congrats Kenn! Sounds like you deserve it!
>>>
>>> On Thu, Sep 20, 2018 at 10:20 AM Udi Meiri  wrote:
>>>
 Congrats!

 On Thu, Sep 20, 2018 at 10:09 AM Raghu Angadi 
 wrote:

> Congrats Kenn!
>
> On Wed, Sep 19, 2018 at 12:54 PM Davor Bonaci 
> wrote:
>
>> Hi everyone --
>> It is with great pleasure that I announce that at today's meeting of
>> the Foundation's Board of Directors, the Board has appointed Kenneth
>> Knowles as the second chair of the Apache Beam project.
>>
>> Kenn has served on the PMC since its inception, and is very active
>> and effective in growing the community. His exemplary posts have been 
>> cited
>> in other projects. I'm super happy to have Kenn accepted the nomination,
>> and I'm confident that he'll serve with distinction.
>>
>> As for myself, I'm not going anywhere. I'm still around and will be
>> as active as I have recently been. Thrilled to be able to pass the baton 
>> to
>> such a key member of this community and to have less administrative work 
>> to
>> do ;-).
>>
>> Please join me in welcoming Kenn to his new role, and I ask that you
>> support him as much as possible. As always, please let me know if you 
>> have
>> any questions.
>>
>> Davor
>>
>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Henning Rohde
+1

On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay  wrote:

> +1 (binding)
>
> On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik  wrote:
>
>> +1 (binding)
>>
>> On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada  wrote:
>>
>>> +1
>>>
>>> On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud 
>>> wrote:
>>>
 +1

 On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik  wrote:

> There was generally positive support and good feedback[1] but it was
> not unanimous. I wanted to bring the donation of the Dataflow worker code
> base to Apache Beam master to a vote.
>
> +1: Support having the Dataflow worker code as part of Apache Beam
> master branch
> -1: Dataflow worker code should live elsewhere
>
> 1:
> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
>

>


Re: PTransforms and Fusion

2018-09-11 Thread Henning Rohde
Empty pipelines have neither subtransforms or a spec, which is what I don't
think is useful -- especially given the only usecase (which is really
"nop") would create non-timer loops in the representations. I'd rather have
a well-known nop primitive instead. Even now, for the A example, I don't
think it's unreasonable to add a (well-known) identity transform inside a
normal composite to retain the extrema at either end. It could be ignored
at runtime at no cost.

To clarify my support for A1, native transforms would have a spec and would
be passed through in the shared code even through they're not primitives.


On Tue, Sep 11, 2018 at 12:56 AM Robert Bradshaw 
wrote:

> For (A), it really boils down to the question of what is a legal pipeline.
> A1 takes the position that all empty transforms must be on a whitelist
> (which implies B1, unless we make the whitelist extensible, which starts to
> sound a lot like B3). Presumably if we want to support B2, we cannot remove
> all empty unknown transforms, just those whose outputs are a subset of the
> inputs.
>
> The reason I strongly support A3 is that empty PTransforms are not just
> noise, they are expressions of user intent, and the pipeline graph should
> reflect that as faithfully as possible. This is the whole point of
> composite transforms--one should not be required to expose what is inside
> (even whether it's empty). Consider, for example, an A, B -> C transform
> that mixes A and B in proportions to produce C. In the degenerate case
> where we want 100% for A or 100% from B, it's reasonable to implement this
> by just returning A or B directly. But when, say, visualizing the pipeline
> graph, I don't think it's desirable to have the discontinuity of the
> composite transform suddenly disappearing when the mixing parameter is at
> either extreme.
>
> If a runner cannot handle these empty pipelines (as is the case for those
> relying on the current Java libraries) it is an easy matter for it to drop
> them, but that doesn't mean we should withhold this information (by making
> it illegal and dropping it in every SDK) from a runner (or any other tool)
> that would want to see this information.
>
> - Robert
>
>
> On Tue, Sep 11, 2018 at 4:20 AM Henning Rohde  wrote:
>
>> For A, I am in favor of A1 and A2 as well. It is then up to each SDK to
>> not generate "empty" transforms in the proto representation as we avoid
>> noise as mentioned. The shared Java libraries are also optional and we
>> should not assume every runner will use them. I'm not convinced empty
>> transforms would have value for pipeline structure over what can be
>> accomplished with normal composites. I suspect empty transforms, such as A,
>> B -> B, B, will just be confusion generators.
>>
>> For B, I favor B2 for the reasons Thomas mentions. I also agree with the
>> -1 for B1.
>>
>> On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise  wrote:
>>
>>> For B, note the prior discussion [1].
>>>
>>> B1 and B2 cannot be supported at the same time.
>>>
>>> Native transforms will almost always be customizations. Users do not
>>> create customizations without reason. They would start with what is there
>>> and dig deeper only when needed. Right now there are no streaming
>>> connectors in the Python SDK - should the user not use the SDK? Or is it
>>> better (now and in general) to have the option of a custom connector, even
>>> when it is not portable?
>>>
>>> Regarding portability, IMO it should be up to the user to decide how
>>> much of it is necessary/important. The IO requirements are normally
>>> dictated by the infrastructure. If it has Kafka, Kinesis or any other
>>> source (including those that Beam might never have a connector for), the
>>> user needs to be able to integrate it.
>>>
>>> Overall extensibility is important and will help users adopt Beam. This
>>> has come up in a few other areas (think Docker environments). I think we
>>> need to provide the flexibility and enable, not prevent alternatives and
>>> therefore -1 for B1 (unsurprisingly :).
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>>>
>>>
>>> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> A) I think it's a bug to not handle empty PTransforms (which are useful
>>>> at pipeline construction, and may still have meaning in terms of pipeline
>>>> structure, e.g. for visualization). Note that such transforms, if truly
>>>>

Re: PTransforms and Fusion

2018-09-10 Thread Henning Rohde
For A, I am in favor of A1 and A2 as well. It is then up to each SDK to not
generate "empty" transforms in the proto representation as we avoid noise
as mentioned. The shared Java libraries are also optional and we should not
assume every runner will use them. I'm not convinced empty transforms would
have value for pipeline structure over what can be accomplished with normal
composites. I suspect empty transforms, such as A, B -> B, B, will just be
confusion generators.

For B, I favor B2 for the reasons Thomas mentions. I also agree with the -1
for B1.

On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise  wrote:

> For B, note the prior discussion [1].
>
> B1 and B2 cannot be supported at the same time.
>
> Native transforms will almost always be customizations. Users do not
> create customizations without reason. They would start with what is there
> and dig deeper only when needed. Right now there are no streaming
> connectors in the Python SDK - should the user not use the SDK? Or is it
> better (now and in general) to have the option of a custom connector, even
> when it is not portable?
>
> Regarding portability, IMO it should be up to the user to decide how much
> of it is necessary/important. The IO requirements are normally dictated by
> the infrastructure. If it has Kafka, Kinesis or any other source (including
> those that Beam might never have a connector for), the user needs to be
> able to integrate it.
>
> Overall extensibility is important and will help users adopt Beam. This
> has come up in a few other areas (think Docker environments). I think we
> need to provide the flexibility and enable, not prevent alternatives and
> therefore -1 for B1 (unsurprisingly :).
>
> [1]
> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>
>
> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
> wrote:
>
>> A) I think it's a bug to not handle empty PTransforms (which are useful
>> at pipeline construction, and may still have meaning in terms of pipeline
>> structure, e.g. for visualization). Note that such transforms, if truly
>> composite, can't output any PCollections that do not appear in their inputs
>> (which is how we distinguish them from primitives in Python). Thus I'm in
>> favor of A3, and as a stopgap we can drop these transforms as part of/just
>> before decoding in the Java libraries (rather than in the SDKs during
>> encoding as in A2).
>>
>> B) I'm also for B1 or B2.
>>
>>
>> On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels 
>> wrote:
>>
>>> > A) What should we do with these "empty" PTransforms?
>>>
>>> We can't translate them, so dropping them seems the most reasonable
>>> choice. Should we throw an error/warning to make the user aware of this?
>>> Otherwise might be unexpected for the user.
>>>
>>> >> A3) Handle the "empty" PTransform case within all of the shared
>>> libraries.
>>>
>>> What can we do at this point other than dropping them?
>>>
>>> > B) What should we do with "native" PTransforms?
>>>
>>> I support B1 and B2 as well. Non-portable PTransforms should be
>>> discouraged in the long run. However, the available PTransforms are not
>>> even consistent across the different SDKs yet (e.g. no streaming
>>> connectors in Python), so we should continue to provide a way to utilize
>>> the "native" transforms of a Runner.
>>>
>>> -Max
>>>
>>> On 07.09.18 19:15, Lukasz Cwik wrote:
>>> > A primitive transform is a PTransform that has been chosen to have no
>>> > default implementation in terms of other PTransforms. A primitive
>>> > transform therefore must be implemented directly by a pipeline runner
>>> in
>>> > terms of pipeline-runner-specific concepts. An initial list of
>>> primitive
>>> > PTransforms were defined in [2] and has since been updated in [3].
>>> >
>>> > As part of the portability effort, libraries that are intended to be
>>> > shared across multiple runners are being developed to support their
>>> > migration to a portable execution model. One of these is responsible
>>> for
>>> > fusing multiple primitive PTransforms together into a pipeline runner
>>> > specific concept. This library made the choice that a primitive
>>> > PTransform is a PTransform that doesn't contain any other PTransforms.
>>> >
>>> > Unfortunately, while Ryan was attempting to enable testing of
>>> validates
>>> > runner tests for Flink using the new portability libraries, he ran
>>> into
>>> > an issue where the Apache Beam Java SDK allows for a person to
>>> construct
>>> > a PTransform that has zero sub PTransforms and also isn't one of the
>>> > defined Apache Beam primitives. In this case the PTransform was
>>> trivial
>>> > as it was not applying any additional transforms to input PCollection
>>> > and just returning it. This caused an issue within the portability
>>> > libraries since they couldn't handle this structure.
>>> >
>>> > To solve this issue, I had proposed that we modify the portability
>>> > library that does 

Re: [Proposal] Creating a reproducible environment for Beam Jenkins Tests

2018-09-10 Thread Henning Rohde
+1 Nice proposal. It will help eradicate some of the inflexibility and
frustrations with Jenkins.

On Wed, Sep 5, 2018 at 2:30 PM Yifan Zou  wrote:

> Thank you all for making comments on this and I apologize for the late
> reply.
>
> To clarify the concerns of testing locally, it is still able to run tests
> without Docker. One of the purposes of this is to create an identical
> environment as we are running in Jenkins that would be helpful to reproduce
> strange errors. Contributors could choose starting a container and run
> tests in there, or just run tests directly.
>
>
>
> On Wed, Sep 5, 2018 at 6:37 AM Ismaël Mejía  wrote:
>
>> BIG +1, the previous work on having docker build images [1] had a
>> similar goal (to have a reproducible build environment). But this is
>> even better because we will guarantee the exact same environment in
>> Jenkins as well as any further improvements. It is important to
>> document the setup process as part of this (for future maintenance +
>> local reproducibility).
>>
>> Just for clarification this is independent of running the tests
>> locally without docker, it is more to improve the reproducibility of
>> the environment we have on jenkins locally for example to address some
>> weird Heissenbug.
>>
>> I just added BEAM-5311 to track the removal of the docker build images
>> when this is ready (of course if there are no objections to this
>> proposal).
>>
>> [1] https://beam.apache.org/contribute/docker-images/
>> On Thu, Aug 30, 2018 at 3:58 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi,
>> >
>> > That's interesting, however, it's really important to still be able to
>> > easily run test locally, without any VM/Docker required. It should be
>> > activated by profile or so.
>> >
>> > Regards
>> > JB
>> >
>> > On 27/08/2018 19:53, Yifan Zou wrote:
>> > > Hi,
>> > >
>> > > I have a proposal for creating a reproducible environment for Jenkins
>> > > tests by using docker container. The thing is, the environment
>> > > configurations on Beam Jenkins slaves are sometimes different from
>> > > developer's machines. Test failures on Jenkins may not be easy to
>> > > reproduce locally. Also, it is not convenient for developers to add or
>> > > modify underlying tools installed on Jenkins VMs, since they're
>> managed
>> > > by Apache Infra. This proposal is aimed to address those problems.
>> > >
>> > >
>> https://docs.google.com/document/d/1y0YuQj_oZXC0uM5-gniG7r9-5gv2uiDhzbtgYYJW48c/edit#heading=h.bg2yi0wbhl9n
>> > >
>> > > Any comments are welcome. Thank you.
>> > >
>> > > Regards.
>> > > Yifan
>> > >
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>
>


Re: Enhancing Environment Proto to support Docker and Process Environments.

2018-08-31 Thread Henning Rohde
Hi Ankur,

There is a separate related thread: "Process JobBundleFactory for portable
runner". It also contains a concrete suggestion for Environment proto
changes to accommodate docker and process based on a urn/payload structure
and a but of discussion.

We should either continue the discussion there or bring the two proposals
together here. What do you think?

Thanks,
 Henning

On Fri, Aug 31, 2018 at 5:47 PM Ankur Goenka  wrote:

> Hi,
>
> We recently added the ProceessEnvironment which uses forked process
> instead of Docker container to run SDKHarness.
> But the current Environment proto does not have a well defined structure
> to represent a process.
> Current Proto:
>
> message Environment {
>
>   // (Required) The URL of a container
>   //
>   // TODO: reconcile with Fn API's DockerContainer structure by
>   // adding adequate metadata to know how to interpret the container
>   string url = 1;
> }
>
> I am planning to enhance the proto to support different types of
> environments (Docker and Process for now).
> Here is the new proposed proto.
>
> message Environment {
>
>   message Docker {
> // (Required) URL for the container image.
> string url = 1;
> // Arguments to the container.
> repeated string arguments = 2;
> // Environment variable for the container.
> repeated string environments = 3;
>   }
>
>   message Process {
> // (Required) Name of the executable.
> string executable = 1;
> // Arguments to the process.
> repeated string arguments = 2;
> // Environment variable for the process.
> repeated string environments = 3;
>   }
>
>   // (Required) The unique id of the environment.
>   string id = 1;
>
>   // (Required) Environment configuration.
>   oneof config {
> Docker docker_config = 2;
> Process process_config = 3;
>   }
> }
>
> Please let me know your thoughts.
>
> Thanks,
> Ankur
>
>
>


Re: Process JobBundleFactory for portable runner

2018-08-24 Thread Henning Rohde
> Do we expect pipelines to always have a single environment for each
PTransform, thus the SDK is dictating how it is launched/managed or do we
expect for each SDK to say here are a couple of ways to run me, letting the
runner to decide?

Good point. I think we should allow multiple environments in the
representation -- even if it's always a singleton, the cost is only a small
loss in readability. A docker and embedded combo might be useful default
for Java, for example. The staged files needed might be different for
different environments, so there are some sharp edges. However, an SDK can
always send just one environment to avoid those.

One idea: we should perhaps allow multiple staged artifact manifests for a
job and add an artifact manifest id to the env (as opposed to having
multi-env pipelines organize the artifacts internally to make sense of
them). Artifacts are global to the pipeline right now, but it might be
easier to manage if they were per environment.

> What does providing the target os/arch provide in beam:env:process:v1?

It would allow supporting mixed linux/windows workers, for example, as well
as allow the runner to reject unsupported process environments upfront as a
sanity check. I'm thinking the latter would be convenient to catch user
mistakes with local/remote runners of different architecture.

Henning

On Fri, Aug 24, 2018 at 1:22 PM Lukasz Cwik  wrote:

> Do we expect pipelines to always have a single environment for each
> PTransform, thus the SDK is dictating how it is launched/managed or do we
> expect for each SDK to say here are a couple of ways to run me, letting the
> runner to decide?
>
> What does providing the target os/arch provide in beam:env:process:v1?
> I would suspect that the user must have supplied a script compatible with
> the cluster being run on. The user knows upfront what is being invoked.
> Note that launching a binary with the current container contract allows the
> binary to do plenty (activate virtualenv, install libraries in some SDK
> specific way oblivious to the runner, ...) so keeping it limited to a
> binary with some args being passed in required by the container contract
> seems to work well.
>
> +1 for making usage of URNs and payloads.
>
> On Fri, Aug 24, 2018 at 1:32 AM Robert Bradshaw 
> wrote:
>
>> I think "external" still needs some way (I was suggesting grpc) to
>> pass the control address, etc. to whatever starts up the workers.
>>
>> Also, +1 to making this a URN. Embedded makes sense too.
>> On Fri, Aug 24, 2018 at 6:00 AM Thomas Weise  wrote:
>> >
>> > Option #3 "external" would fit the Kubernetes use case we discussed a
>> while ago also. Container(s) can be part of the same pod and need to find
>> the runner.
>> >
>> > There is another option: "embedded". When the SDK is Java and the
>> runner Flink (or all the other OSS runners), then harness can (optionally)
>> run embedded in the same JVM.
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> > On Thu, Aug 23, 2018 at 9:14 AM Henning Rohde 
>> wrote:
>> >>
>> >> A process-based SDK harness does not IMO imply that the host is fully
>> provisioned by the SDK/user and invoking the user command line in the
>> context of the staged files is a critical aspect for it to work. So I
>> consider staged artifact support needed. Also, I would like to suggest that
>> we move to a concrete environment proto to crystalize what is actually
>> being proposed. I'm not sure what activating a virtualenv would look like,
>> for example. To start things off:
>> >>
>> >> message Environment {
>> >>   string urn = 1;
>> >>   bytes payload = 2;
>> >> }
>> >>
>> >> // urn == "beam:env:docker:v1"
>> >> message DockerPayload {
>> >>   string container_image = 1;  // implicitly linux_amd64.
>> >> }
>> >>
>> >> // urn == "beam:env:process:v1"
>> >> message ProcessPayload {
>> >>   string os = 1;  // "linux", "darwin", ..
>> >>   string arch = 2;  // "amd64", ..
>> >>   string command_line = 3;
>> >> }
>> >>
>> >> // urn == "beam:env:external:v1"
>> >> // (no payload)
>> >>
>> >> A runner may support any subset and reject any unsupported
>> configuration. There are 3 kinds of environments that I think are useful:
>> >>  (1) docker: works as currently. Offers the most flexibility for SDKs
>> and users, especially when the runner is outside the control (such as
>&

Re: [Proposal] Track non-code contributions in Jira

2018-08-24 Thread Henning Rohde
+1

On Fri, Aug 24, 2018 at 11:44 AM Rose Nguyen  wrote:

> +1 Great idea
>
> On Fri, Aug 24, 2018 at 10:01 AM Mikhail Gryzykhin 
> wrote:
>
>> +1. Idea sounds great.
>>
>> --Mikhail
>>
>> Have feedback ?
>>
>>
>> On Fri, Aug 24, 2018 at 7:19 AM Maximilian Michels 
>> wrote:
>>
>>> +1 Code is just one part of a successful open-source project. As long as
>>> the tasks are properly labelled and actionable, I think it works to put
>>> them into JIRA.
>>>
>>> On 24.08.18 15:09, Matthias Baetens wrote:
>>> >
>>> > I fully agree and think it is a great idea.
>>> >
>>> > I think that, next to visibility and keeping track of everything that
>>> is
>>> > going on in the community, the other goal would be documenting best
>>> > practices for future use.
>>> >
>>> > I am also not sure, though, if JIRA is the best place to do so, as
>>> > Austin raised.
>>> > Introducing (yet) another tool on the other hand, might also not be
>>> > ideal. Has anyone else experience with this from other Apache projects?
>>> >
>>> > On Fri, 24 Aug 2018 at 06:04 Austin Bennett <
>>> whatwouldausti...@gmail.com
>>> > > wrote:
>>> >
>>> > Certainly tracking and managing these are important -- though, is
>>> > Jira the best tool for these things?
>>> >
>>> > I do see it useful to put in Jira tickets in for my director to
>>> have
>>> > conversations on specific topics with people, for consensus
>>> > building, etc etc.  So, I have seen it work even for non-coding
>>> tasks.
>>> >
>>> > It seems like much of #s 2-6 mentioned requires project management
>>> > applied to those specific domains and is applicable elsewhere,
>>> > wondering what constitutes "pure" project management in #1 (as it
>>> > applies here)...?  In that light I'm just getting picky about
>>> > taxonomy :-)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Aug 23, 2018 at 3:10 PM Alan Myrvold >> > > wrote:
>>> >
>>> > I like the idea of recognizing non-code contributions. These
>>> > other efforts have been very helpful.
>>> >
>>> > On Thu, Aug 23, 2018 at 3:07 PM Griselda Cuevas <
>>> g...@google.com
>>> > > wrote:
>>> >
>>> > Hi Beam Community,
>>> >
>>> > I'd like to start tracking non-code contributions for Beam,
>>> > specially around these six categories:
>>> > 1) Project Management
>>> > 2) Community Management
>>> > 3) Advocacy
>>> > 4) Events & Meetups
>>> > 5) Documentation
>>> > 6) Training
>>> >
>>> > The proposal would be to create six boards in Jira, one per
>>> > proposed category, and as part of this initiative also
>>> clean
>>> > the already existing "Project Management" component, i.e.
>>> > making sure all issues there are still relevant.
>>> >
>>> > After this, I'd also create a landing page in the website
>>> > that talks about all types of contributions to the project.
>>> >
>>> > The reason for doing this is mainly to give visibility to
>>> > some of the great work our community does beyond code
>>> pushes
>>> > in Github. Initiatives around Beam are starting to spark
>>> > around the world, and it'd be great to become an Apache
>>> > project recognized for our outstanding community
>>> recognition.
>>> >
>>> > What are your thoughts?
>>> > G
>>> >
>>> > Gris
>>> >
>>> > --
>>>
>>> --
>>> Max
>>>
>>
>
> --
>
>
> Rose Thi Nguyen
>
>   Technical Writer
>
> (281) 683-6900
>


Re: Process JobBundleFactory for portable runner

2018-08-23 Thread Henning Rohde
 to us too.
>> >> On the same note, we are trying out portable_runner.py to submit a
>> >> python job. Seems it will create a default docker url even if the
>> >> hardness_docker_image is set to None in pipeline options. Shall we add
>> >> another option or honor the None in this option to support the process
>> >> job? I made some local changes right now to walk around this.
>> >>
>> >> Thanks,
>> >> Xinyu
>> >>
>> >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde > >> <mailto:hero...@google.com>> wrote:
>> >>
>> >> By "enum" in quotes, I meant the usual open URN style pattern not
>> an
>> >> actual enum. Sorry if that wasn't clear.
>> >>
>> >> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik > >> <mailto:lc...@google.com>> wrote:
>> >>
>> >> I would model the environment to be more free form then enums
>> >> such that we have forward looking extensibility and would
>> >> suggest to follow the same pattern we use on PTransforms (using
>> >> an URN and a URN specific payload). Note that in this case we
>> >> may want to support a list of supported environments (e.g.
>> java,
>> >> docker, python, ...).
>> >>
>> >> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>> >> mailto:hero...@google.com>> wrote:
>> >>
>> >> One thing to consider that we've talked about in the past.
>> >> It might make sense to extend the environment proto and
>> have
>> >> the SDK be explicit about which kinds of environment it
>> >> supports:
>> >>
>> >>
>> >>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >>
>> >>
>> >> <
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>
>>
>> >>
>> >>
>> >> This choice might impact what files are staged or what not.
>> >> Some SDKs, such as Go, provide a compiled binary and _need_
>> >> to know what the target architecture is. Running on a mac
>> >> with and without docker, say, requires a different worker
>> in
>> >> each case. If we add an "enum", we can also easily add the
>> >> external idea where the SDK/user starts the SDK harnesses
>> >> instead of the runner. Each runner may not support all
>> types
>> >> of environments.
>> >>
>> >> Henning
>> >>
>> >> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels
>> >> mailto:m...@apache.org>> wrote:
>> >>
>> >> For reference, here is corresponding JIRA issue for
>> this
>> >> thread:
>> >> https://issues.apache.org/jira/browse/BEAM-5187
>> >> <https://issues.apache.org/jira/browse/BEAM-5187>
>> >>
>> >> On 16.08.18 11:15, Maximilian Michels wrote:
>> >>  > Makes sense to have an option to run the SDK harness
>> >> in a non-dockerized
>> >>  > environment.
>> >>  >
>> >>  > I'm in the process of creating a Docker entry point
>> >> for Flink's
>> >>  > JobServer[1]. I suppose you would also prefer to
>> >> execute that one
>> >>  > standalone. We should make sure this is also an
>> >> option.
>> >>  >
>> >>  > [1] https://issues.apache.org/jira/browse/BEAM-4130
>> >> <https://issues.apache.org/jira/browse/BEAM-4130>
>> >>  >
>> >>  > On 16.08.18 07:42, Thomas Weise wrote:
>> >>  >> Yes, that's the proposal. Everything that would
>> >> otherwise be packaged
>> >>  >> into the Docker container would ne

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-22 Thread Henning Rohde
Sending bundles that cannot be executed, i.e., the situation described to
cause deadlock in Flink in the beginning of the thread with mapB. The
discussion of exposing (or assuming an infinitely large) concurrency level
-- while a useful concept in its own right -- came around as a way to
unblock mapB.

On Tue, Aug 21, 2018 at 2:16 PM Lukasz Cwik  wrote:

> Henning, can you clarify by what you mean with send non-executable bundles
> to the SDK harness and how it is useful for Flink?
>
> On Tue, Aug 21, 2018 at 2:01 PM Henning Rohde  wrote:
>
>> I think it will be useful to the runner to know upfront what the
>> fundamental threading capabilities are for the SDK harness (say, "fixed",
>> "linear", "dynamic", ..) so that the runner can upfront make a good static
>> decision on #harnesses and how many resources they should each have. It's
>> wasteful to give the Foo SDK a whole many-core machine with TBs of memory,
>> if it can only support a single bundle at a time. I think this is also in
>> line with what Thomas and Luke are suggesting.
>>
>> However, it still seems to me to be a semantically problematic idea to
>> send non-executable bundles to the SDK harness. I understand it's useful
>> for Flink, but is that really the best path forward?
>>
>>
>>
>> On Mon, Aug 20, 2018 at 5:44 PM Ankur Goenka  wrote:
>>
>>> That's right.
>>> To add to it. We added multi threading to python streaming as a single
>>> thread is sub optimal for streaming use case.
>>> Shall we move towards a conclusion on the SDK bundle processing upper
>>> bound?
>>>
>>> On Mon, Aug 20, 2018 at 1:54 PM Lukasz Cwik  wrote:
>>>
>>>> Ankur, I can see where you are going with your argument. I believe
>>>> there is certain information which is static and won't change at pipeline
>>>> creation time (such as Python SDK is most efficient doing one bundle at a
>>>> time) and some stuff which is best at runtime, like memory and CPU limits,
>>>> worker count.
>>>>
>>>> On Mon, Aug 20, 2018 at 1:47 PM Ankur Goenka  wrote:
>>>>
>>>>> I would prefer to to keep it dynamic as it can be changed by the
>>>>> infrastructure or the pipeline author.
>>>>> Like in case of Python, number of concurrent bundle can be changed by
>>>>> setting pipeline option worker_count. And for Java it can be computed 
>>>>> based
>>>>> on the cpus on the machine.
>>>>>
>>>>> For Flink runner, we can use the worker_count parameter for now to
>>>>> increase the parallelism. And we can have 1 container for each 
>>>>> mapPartition
>>>>> task on Flink while reusing containers as container creation is expensive
>>>>> especially for Python where it installs a bunch of dependencies. There is 
>>>>> 1
>>>>> caveat though. I have seen machine crashes because of too many 
>>>>> simultaneous
>>>>> container creation. We can rate limit container creation in the code to
>>>>> avoid this.
>>>>>
>>>>> Thanks,
>>>>> Ankur
>>>>>
>>>>> On Mon, Aug 20, 2018 at 9:20 AM Lukasz Cwik  wrote:
>>>>>
>>>>>> +1 on making the resources part of a proto. Based upon what Henning
>>>>>> linked to, the provisioning API seems like an appropriate place to 
>>>>>> provide
>>>>>> this information.
>>>>>>
>>>>>> Thomas, I believe the environment proto is the best place to add
>>>>>> information that a runner may want to know about upfront during pipeline
>>>>>> pipeline creation. I wouldn't stick this into PipelineOptions for the 
>>>>>> long
>>>>>> term.
>>>>>> If you have time to capture these thoughts and update the environment
>>>>>> proto, I would suggest going down that path. Otherwise anything short 
>>>>>> term
>>>>>> like PipelineOptions will do.
>>>>>>
>>>>>> On Sun, Aug 19, 2018 at 5:41 PM Thomas Weise  wrote:
>>>>>>
>>>>>>> For SDKs where the upper limit is constant and known upfront, why
>>>>>>> not communicate this along with the other harness resource info as part 
>>>>>>> of
>>>>>>> the job submission?
>>>>>>>
>>>>>>> Regarding use 

Re: Bootstrapping Beam's Job Server

2018-08-21 Thread Henning Rohde
Agree with Luke. Perhaps something simple, prescriptive yet flexible, such
as custom command line (defined in the environment proto) rooted at the
base of the provided artifacts and either passed the same arguments or
defined in the container contract or made available through substitution.
That way, all the restrictions/assumptions of the execution environment
become implicit and runner/deployment dependent.


On Tue, Aug 21, 2018 at 2:12 PM Lukasz Cwik  wrote:

> I believe supporting a simple Process environment makes sense. It would be
> best if we didn't make the Process route solve all the problems that Docker
> solves for us. In my opinion we should limit the Process route to assume
> that the execution environment:
> * has all dependencies and libraries installed
> * is of a compatible machine architecture
> * doesn't require special networking rules to be setup
>
> Any other suggestions for reasonable limits on a Process environment?
>
> On Tue, Aug 21, 2018 at 2:53 AM Ismaël Mejía  wrote:
>
>> It is also worth to mention that apart of the testing/development use
>> case there is also the case of supporting people running in Hadoop
>> distributions. There are two extra reasons to want a process based
>> version: (1) Some Hadoop distributions run in machines with really old
>> kernels where docker support is limited or nonexistent (yes, some of
>> those run on kernel 2.6!) and (2) Ops people may be reticent to the
>> additional operational overhead of enabling docker in their clusters.
>> On Tue, Aug 21, 2018 at 11:50 AM Maximilian Michels 
>> wrote:
>> >
>> > Thanks Henning and Thomas. It looks like
>> >
>> > a) we want to keep the Docker Job Server Docker container and rely on
>> > spinning up "sibling" SDK harness containers via the Docker socket. This
>> > should require little changes to the Runner code.
>> >
>> > b) have the InProcess SDK harness as an alternative way to running user
>> > code. This can be done independently of a).
>> >
>> > Thomas, let's sync today on the InProcess SDK harness. I've created a
>> > JIRA issue: https://issues.apache.org/jira/browse/BEAM-5187
>> >
>> > Cheers,
>> > Max
>> >
>> > On 21.08.18 00:35, Thomas Weise wrote:
>> > > The original objective was to make test/development easier (which I
>> > > think is super important for user experience with portable runner).
>> > >
>> > >  From first hand experience I can confirm that dealing with Flink
>> > > clusters and Docker containers for local setup is a significant hurdle
>> > > for Python developers.
>> > >
>> > > To simplify using Flink in embedded mode, the (direct) process based
>> SDK
>> > > harness would be a good option, especially when it can be linked to
>> the
>> > > same virtualenv that developers have already setup, eliminating extra
>> > > packaging/deployment steps.
>> > >
>> > > Max, I would be interested to sync up on what your thoughts are
>> > > regarding that option since you mention you also started to work on it
>> > > (see previous discussion [1], not sure if there is a JIRA for it yet).
>> > > Internally we are planning to use a direct SDK harness process instead
>> > > of Docker containers. For our specific needs it will works equally
>> well
>> > > for development and production, including future plans to deploy Flink
>> > > TMs via Kubernetes.
>> > >
>> > > Thanks,
>> > > Thomas
>> > >
>> > > [1]
>> > >
>> https://lists.apache.org/thread.html/d8b81e9f74f77d74c8b883cda80fa48efdcaf6ac2ad313c4fe68795a@%3Cdev.beam.apache.org%3E
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels > > > > wrote:
>> > >
>> > > Thanks for your suggestions. Please see below.
>> > >
>> > >  > Option 3) would be to map in the docker binary and socket to
>> allow
>> > >  > the containerized Flink job server to start "sibling"
>> containers on
>> > >  > the host.
>> > >
>> > > Do you mean packaging Docker inside the Job Server container and
>> > > mounting /var/run/docker.sock from the host inside the container?
>> That
>> > > looks like a bit of a hack but for testing it could be fine.
>> > >
>> > >  > notably, if the runner supports auto-scaling or similar
>> non-trivial
>> > >  > configurations, that would be difficult to manage from the SDK
>> side.
>> > >
>> > > You're right, it would be unfortunate if the SDK would have to
>> deal with
>> > > spinning up SDK harness/backend containers. For non-trivial
>> > > configurations it would probably require an extended protocol.
>> > >
>> > >  > Option 4) We are also thinking about adding process based
>> SDKHarness.
>> > >  > This will avoid docker in docker scenario.
>> > >
>> > > Actually, I had started implementing a process-based SDK harness
>> but
>> > > figured it might be impractical because it doubles the execution
>> path
>> > > for UDF code and potentially doesn't work with custom
>> dependencies.
>> > >
>> > >  > 

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-21 Thread Henning Rohde
I think it will be useful to the runner to know upfront what the
fundamental threading capabilities are for the SDK harness (say, "fixed",
"linear", "dynamic", ..) so that the runner can upfront make a good static
decision on #harnesses and how many resources they should each have. It's
wasteful to give the Foo SDK a whole many-core machine with TBs of memory,
if it can only support a single bundle at a time. I think this is also in
line with what Thomas and Luke are suggesting.

However, it still seems to me to be a semantically problematic idea to send
non-executable bundles to the SDK harness. I understand it's useful for
Flink, but is that really the best path forward?



On Mon, Aug 20, 2018 at 5:44 PM Ankur Goenka  wrote:

> That's right.
> To add to it. We added multi threading to python streaming as a single
> thread is sub optimal for streaming use case.
> Shall we move towards a conclusion on the SDK bundle processing upper
> bound?
>
> On Mon, Aug 20, 2018 at 1:54 PM Lukasz Cwik  wrote:
>
>> Ankur, I can see where you are going with your argument. I believe there
>> is certain information which is static and won't change at pipeline
>> creation time (such as Python SDK is most efficient doing one bundle at a
>> time) and some stuff which is best at runtime, like memory and CPU limits,
>> worker count.
>>
>> On Mon, Aug 20, 2018 at 1:47 PM Ankur Goenka  wrote:
>>
>>> I would prefer to to keep it dynamic as it can be changed by the
>>> infrastructure or the pipeline author.
>>> Like in case of Python, number of concurrent bundle can be changed by
>>> setting pipeline option worker_count. And for Java it can be computed based
>>> on the cpus on the machine.
>>>
>>> For Flink runner, we can use the worker_count parameter for now to
>>> increase the parallelism. And we can have 1 container for each mapPartition
>>> task on Flink while reusing containers as container creation is expensive
>>> especially for Python where it installs a bunch of dependencies. There is 1
>>> caveat though. I have seen machine crashes because of too many simultaneous
>>> container creation. We can rate limit container creation in the code to
>>> avoid this.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Mon, Aug 20, 2018 at 9:20 AM Lukasz Cwik  wrote:
>>>
>>>> +1 on making the resources part of a proto. Based upon what Henning
>>>> linked to, the provisioning API seems like an appropriate place to provide
>>>> this information.
>>>>
>>>> Thomas, I believe the environment proto is the best place to add
>>>> information that a runner may want to know about upfront during pipeline
>>>> pipeline creation. I wouldn't stick this into PipelineOptions for the long
>>>> term.
>>>> If you have time to capture these thoughts and update the environment
>>>> proto, I would suggest going down that path. Otherwise anything short term
>>>> like PipelineOptions will do.
>>>>
>>>> On Sun, Aug 19, 2018 at 5:41 PM Thomas Weise  wrote:
>>>>
>>>>> For SDKs where the upper limit is constant and known upfront, why not
>>>>> communicate this along with the other harness resource info as part of the
>>>>> job submission?
>>>>>
>>>>> Regarding use of GRPC headers: Why not make this explicit in the proto
>>>>> instead?
>>>>>
>>>>> WRT runner dictating resource constraints: The runner actually may
>>>>> also not have that information. It would need to be supplied as part of 
>>>>> the
>>>>> pipeline options? The cluster resource manager needs to allocate resources
>>>>> for both, the runner and the SDK harness(es).
>>>>>
>>>>> Finally, what can be done to unblock the Flink runner / Python until
>>>>> solution discussed here is in place? An extra runner option for SDK
>>>>> singleton on/off?
>>>>>
>>>>>
>>>>> On Sat, Aug 18, 2018 at 1:34 AM Ankur Goenka 
>>>>> wrote:
>>>>>
>>>>>> Sounds good to me.
>>>>>> GRPC Header of the control channel seems to be a good place to add
>>>>>> upper bound information.
>>>>>> Added jiras:
>>>>>> https://issues.apache.org/jira/browse/BEAM-5166
>>>>>> https://issues.apache.org/jira/browse/BEAM-5167
>>>>>>
>>>>>> On Fri, Aug 17, 2018 at 10:51 PM H

Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Henning Rohde
By "enum" in quotes, I meant the usual open URN style pattern not an actual
enum. Sorry if that wasn't clear.

On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik  wrote:

> I would model the environment to be more free form then enums such that we
> have forward looking extensibility and would suggest to follow the same
> pattern we use on PTransforms (using an URN and a URN specific payload).
> Note that in this case we may want to support a list of supported
> environments (e.g. java, docker, python, ...).
>
> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde  wrote:
>
>> One thing to consider that we've talked about in the past. It might make
>> sense to extend the environment proto and have the SDK be explicit about
>> which kinds of environment it supports:
>>
>>
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>>
>> This choice might impact what files are staged or what not. Some SDKs,
>> such as Go, provide a compiled binary and _need_ to know what the target
>> architecture is. Running on a mac with and without docker, say, requires a
>> different worker in each case. If we add an "enum", we can also easily add
>> the external idea where the SDK/user starts the SDK harnesses instead of
>> the runner. Each runner may not support all types of environments.
>>
>> Henning
>>
>> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels 
>> wrote:
>>
>>> For reference, here is corresponding JIRA issue for this thread:
>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>
>>> On 16.08.18 11:15, Maximilian Michels wrote:
>>> > Makes sense to have an option to run the SDK harness in a
>>> non-dockerized
>>> > environment.
>>> >
>>> > I'm in the process of creating a Docker entry point for Flink's
>>> > JobServer[1]. I suppose you would also prefer to execute that one
>>> > standalone. We should make sure this is also an option.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>> >
>>> > On 16.08.18 07:42, Thomas Weise wrote:
>>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>>> >> into the Docker container would need to be pre-installed in the host
>>> >> environment. In the case of Python SDK, that could simply mean a
>>> >> (frozen) virtual environment that was setup when the host was
>>> >> provisioned - the SDK harness process(es) will then just utilize that.
>>> >> Of course this flavor of SDK harness execution could also be useful in
>>> >> the local development environment, where right now someone who already
>>> >> has the Python environment needs to also install Docker and update a
>>> >> container to launch a Python SDK pipeline on the Flink runner.
>>> >>
>>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>>> danolive...@google.com
>>> >> <mailto:danolive...@google.com>> wrote:
>>> >>
>>> >>  I just want to clarify that I understand this correctly since I'm
>>> >>  not that familiar with the details behind all these execution
>>> >>  environments yet. Is the proposal to create a new
>>> JobBundleFactory
>>> >>  that instead of using Docker to create the environment that the
>>> new
>>> >>  processes will execute in, this JobBundleFactory would execute
>>> the
>>> >>  new processes directly in the host environment? So in practice
>>> if I
>>> >>  ran a pipeline with this JobBundleFactory the SDK Harness and
>>> Runner
>>> >>  Harness would both be executing directly on my machine and would
>>> >>  depend on me having the dependencies already present on my
>>> machine?
>>> >>
>>> >>  On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka >> >>  <mailto:goe...@google.com>> wrote:
>>> >>
>>> >>  Thanks for starting the discussion. I will be happy to help.
>>> >>  I agree, we should have pluggable SDKHarness environment
>>> Factory.
>>> >>  We can register multiple Environment factory using service
>>> >>  registry and use the PipelineOption to pick the right one on
>>> per
>>> >>  job basis.
>>> >>
>>&

Re: Process JobBundleFactory for portable runner

2018-08-21 Thread Henning Rohde
One thing to consider that we've talked about in the past. It might make
sense to extend the environment proto and have the SDK be explicit about
which kinds of environment it supports:


https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969

This choice might impact what files are staged or what not. Some SDKs, such
as Go, provide a compiled binary and _need_ to know what the target
architecture is. Running on a mac with and without docker, say, requires a
different worker in each case. If we add an "enum", we can also easily add
the external idea where the SDK/user starts the SDK harnesses instead of
the runner. Each runner may not support all types of environments.

Henning

On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels  wrote:

> For reference, here is corresponding JIRA issue for this thread:
> https://issues.apache.org/jira/browse/BEAM-5187
>
> On 16.08.18 11:15, Maximilian Michels wrote:
> > Makes sense to have an option to run the SDK harness in a non-dockerized
> > environment.
> >
> > I'm in the process of creating a Docker entry point for Flink's
> > JobServer[1]. I suppose you would also prefer to execute that one
> > standalone. We should make sure this is also an option.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-4130
> >
> > On 16.08.18 07:42, Thomas Weise wrote:
> >> Yes, that's the proposal. Everything that would otherwise be packaged
> >> into the Docker container would need to be pre-installed in the host
> >> environment. In the case of Python SDK, that could simply mean a
> >> (frozen) virtual environment that was setup when the host was
> >> provisioned - the SDK harness process(es) will then just utilize that.
> >> Of course this flavor of SDK harness execution could also be useful in
> >> the local development environment, where right now someone who already
> >> has the Python environment needs to also install Docker and update a
> >> container to launch a Python SDK pipeline on the Flink runner.
> >>
> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
> danolive...@google.com
> >> > wrote:
> >>
> >>  I just want to clarify that I understand this correctly since I'm
> >>  not that familiar with the details behind all these execution
> >>  environments yet. Is the proposal to create a new JobBundleFactory
> >>  that instead of using Docker to create the environment that the new
> >>  processes will execute in, this JobBundleFactory would execute the
> >>  new processes directly in the host environment? So in practice if I
> >>  ran a pipeline with this JobBundleFactory the SDK Harness and
> Runner
> >>  Harness would both be executing directly on my machine and would
> >>  depend on me having the dependencies already present on my machine?
> >>
> >>  On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka  >>  > wrote:
> >>
> >>  Thanks for starting the discussion. I will be happy to help.
> >>  I agree, we should have pluggable SDKHarness environment
> Factory.
> >>  We can register multiple Environment factory using service
> >>  registry and use the PipelineOption to pick the right one on
> per
> >>  job basis.
> >>
> >>  There are a couple of things which are require to setup before
> >>  launching the process.
> >>
> >>* Setting up the environment as done in boot.go [4]
> >>* Retrieving and putting the artifacts in the right location.
> >>
> >>  You can probably leverage boot.go code to setup the
> environment.
> >>
> >>  Also, it will be useful to enumerate pros and cons of different
> >>  Environments to help users choose the right one.
> >>
> >>
> >>  On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise  >>  > wrote:
> >>
> >>  Hi,
> >>
> >>  Currently the portable Flink runner only works with SDK
> >>  Docker containers for execution (DockerJobBundleFactory,
> >>  besides an in-process (embedded) factory option for testing
> >>  [1]). I'm considering adding another out of process
> >>  JobBundleFactory implementation that directly forks the
> >>  processes on the task manager host, eliminating the need
> for
> >>  Docker. This would work reasonably well in environments
> >>  where the dependencies (in this case Python) can easily be
> >>  tied into the host deployment (also within an application
> >>  specific Kubernetes pod).
> >>
> >>  There was already some discussion about alternative
> >>  JobBundleFactory implementation in [2]. There is also a
> JIRA
> >>  to make the bundle factory pluggable [3], pending
> >>  availability of runner level options.
> >>
> >>  

Re: Bootstrapping Beam's Job Server

2018-08-20 Thread Henning Rohde
>> Option 3) would be to map in the docker binary and socket to allow
>> the containerized Flink job server to start "sibling" containers on
>> the host.
>
>Do you mean packaging Docker inside the Job Server container and
>mounting /var/run/docker.sock from the host inside the container? That
>looks like a bit of a hack but for testing it could be fine.

Basically, yes, although I would also map in the docker binary itself to
ensure compatibility with the host. It's not something I would suggest for
production use -- just for local jobs. It would allow the Go SDK to just
work OOB, for example.

The process-based scenario can be a configuration feature of each
SDK/runner. I see that as a useful complement to dockerized SDKs, although
the onus is then on the runner/user to ensure the environment is adequate
for the SDK(s) used in the job. The main appeal of docker is precisely to
not have that requirement, but for some deployments it is reasonable.

Henning


On Mon, Aug 20, 2018 at 3:00 PM Maximilian Michels  wrote:

> Thanks for your suggestions. Please see below.
>
> > Option 3) would be to map in the docker binary and socket to allow
> > the containerized Flink job server to start "sibling" containers on
> > the host.
>
> Do you mean packaging Docker inside the Job Server container and
> mounting /var/run/docker.sock from the host inside the container? That
> looks like a bit of a hack but for testing it could be fine.
>
> > notably, if the runner supports auto-scaling or similar non-trivial
> > configurations, that would be difficult to manage from the SDK side.
>
> You're right, it would be unfortunate if the SDK would have to deal with
> spinning up SDK harness/backend containers. For non-trivial
> configurations it would probably require an extended protocol.
>
> > Option 4) We are also thinking about adding process based SDKHarness.
> > This will avoid docker in docker scenario.
>
> Actually, I had started implementing a process-based SDK harness but
> figured it might be impractical because it doubles the execution path
> for UDF code and potentially doesn't work with custom dependencies.
>
> > Process based SDKHarness also has other applications and might be
> > desirable in some of the production use cases.
>
> True. Some users might want something more lightweight.
>


Re: Bootstrapping Beam's Job Server

2018-08-20 Thread Henning Rohde
Option 3) would be to map in the docker binary and socket to allow the
containerized Flink job server to start "sibling" containers on the host.
That both avoids docker-in-docker (which is indeed undesirable) as well as
extra requirements for each SDK to spin up containers -- notably, if the
runner supports auto-scaling or similar non-trivial configurations, that
would be difficult to manage from the SDK side.

Henning

On Mon, Aug 20, 2018 at 8:31 AM Maximilian Michels  wrote:

> Hi everyone,
>
> I wanted to get your opinion on the Job-Server startup [1] which is part
> of the portability story.
>
> I've created a docker container to bring up Beam's Job Server, which is
> the entry point for pipeline execution. Generally, this works fine when
> the backend (Flink in this case) runs externally and the Job Server
> connects to it.
>
> For tests or pipeline development we may want the backend to run
> embedded (inside the Job Server) which is rather problematic because the
> portability requires to spin up the SDK harness in a Docker container as
> well. This would happen at runtime inside the Docker container.
>
> Since Docker inside Docker is not desirable I'm thinking about other
> options:
>
> Option 1) Instead of a Docker container, we start a bundled Job-Server
> binary (or jar) when we run the pipeline. The bundle also contains an
> embedded variant of the backend. For Flink, this is basically the output
> of `:beam-runners-flink_2.11-job-server:shadowJar` but it is started
> during pipeline execution.
>
> Option 2) In addition to the Job Server, we let the SDK spin up another
> Docker container with the backend. This is may be most applicable to all
> types of backends since not all backends offer an embedded execution mode.
>
>
> Keep in mind that this is only a problem for local/test execution but it
> is an important aspect of Beam's usability.
>
> What do you think? I'm leaning towards option 2. Maybe you have other
> options in mind.
>
> Cheers,
> Max
>
> [1] https://issues.apache.org/jira/browse/BEAM-4130
>


Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-17 Thread Henning Rohde
Regarding resources: the runner can currently dictate the mem/cpu/disk
resources that the harness is allowed to use via the provisioning api. The
SDK harness need not -- and should not -- speculate on what else might be
running on the machine:

https://github.com/apache/beam/blob/0e14965707b5d48a3de7fa69f09d88ef0aa48c09/model/fn-execution/src/main/proto/beam_provision_api.proto#L69

A realistic startup-time computation in the SDK harness would be something
simple like: max(1, min(cpu*100, mem_mb/10)) say, and use that at most
number of threads. Or just hardcode to 300. Or a user-provided value.
Whatever the value is the maximum number of bundles in flight allowed at
any given time and needs to be communicated to the runner via some message.
Anything beyond would be rejected (but this shouldn't happen, because the
runner should respect that number).

A dynamic computation would use the same limits from the SDK, but take into
account its own resource usage (incl. the usage by running bundles).

On Fri, Aug 17, 2018 at 6:20 PM Ankur Goenka  wrote:

> I am thinking upper bound to be more on the lines of theocratical upper
> limit or any other static high value beyond which the SDK will reject
> bundle verbosely. The idea is that SDK will not keep bundles in queue while
> waiting on current bundles to finish. It will simply reject any additional
> bundle.
> Beyond this I don't have a good answer to dynamic upper bound. As SDK does
> not have the complete picture of processes on the machine with which it
> share resources, resources might not be a good proxy for upper bound from
> the SDK point of view.
>
> On Fri, Aug 17, 2018 at 6:01 PM Lukasz Cwik  wrote:
>
>> Ankur, how would you expect an SDK to compute a realistic upper bound
>> (upfront or during pipeline computation)?
>>
>> First thought that came to my mind was that the SDK would provide
>> CPU/memory/... resourcing information and the runner making a judgement
>> call as to whether it should ask the SDK to do more work or less but its
>> not an explicit don't do more then X bundles in parallel.
>>
>> On Fri, Aug 17, 2018 at 5:55 PM Ankur Goenka  wrote:
>>
>>> Makes sense. Having exposed upper bound on concurrency with optimum
>>> concurrency can give a good balance. This is good information to expose
>>> while keeping the requirements from the SDK simple. SDK can publish 1 as
>>> the optimum concurrency and upper bound to keep things simple.
>>>
>>> Runner introspection of upper bound on concurrency is important for
>>> correctness while introspection of optimum concurrency is important for
>>> efficiency. This separates efficiency and correctness requirements.
>>>
>>> On Fri, Aug 17, 2018 at 5:05 PM Henning Rohde 
>>> wrote:
>>>
>>>> I agree with Luke's observation, with the caveat that "infinite amount
>>>> of bundles in parallel" is limited by the available resources. For example,
>>>> the Go SDK harness will accept an arbitrary amount of parallel work, but
>>>> too much work will cause either excessive GC pressure with crippling
>>>> slowness or an outright OOM. Unless it's always 1, a reasonable upper bound
>>>> will either have to be provided by the user or computed from the mem/cpu
>>>> resources given. Of course, as some bundles takes more resources than
>>>> others, so any static value will be an estimate or ignore resource limits.
>>>>
>>>> That said, I do not like that an "efficiency" aspect becomes a subtle
>>>> requirement for correctness due to Flink internals. I fear that road leads
>>>> to trouble.
>>>>
>>>> On Fri, Aug 17, 2018 at 4:26 PM Ankur Goenka  wrote:
>>>>
>>>>> The later case of having a of supporting single bundle execution at a
>>>>> time on SDK and runner not using this flag is exactly the reason we got
>>>>> into the Dead Lock here.
>>>>> I agree with exposing SDK optimum concurrency level ( 1 in later case
>>>>> ) and let runner decide to use it or not. But at the same time expect SDK
>>>>> to handle infinite amount of bundles even if its not efficient.
>>>>>
>>>>> Thanks,
>>>>> Ankur
>>>>>
>>>>> On Fri, Aug 17, 2018 at 4:11 PM Lukasz Cwik  wrote:
>>>>>
>>>>>> I believe in practice SDK harnesses will fall into one of two
>>>>>> capabilities, can process effectively an infinite amount of bundles in
>>>>>> parallel or can only process a single bundle at a time.
>>>>>>
&g

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-17 Thread Henning Rohde
I agree with Luke's observation, with the caveat that "infinite amount of
bundles in parallel" is limited by the available resources. For example,
the Go SDK harness will accept an arbitrary amount of parallel work, but
too much work will cause either excessive GC pressure with crippling
slowness or an outright OOM. Unless it's always 1, a reasonable upper bound
will either have to be provided by the user or computed from the mem/cpu
resources given. Of course, as some bundles takes more resources than
others, so any static value will be an estimate or ignore resource limits.

That said, I do not like that an "efficiency" aspect becomes a subtle
requirement for correctness due to Flink internals. I fear that road leads
to trouble.

On Fri, Aug 17, 2018 at 4:26 PM Ankur Goenka  wrote:

> The later case of having a of supporting single bundle execution at a time
> on SDK and runner not using this flag is exactly the reason we got into the
> Dead Lock here.
> I agree with exposing SDK optimum concurrency level ( 1 in later case )
> and let runner decide to use it or not. But at the same time expect SDK to
> handle infinite amount of bundles even if its not efficient.
>
> Thanks,
> Ankur
>
> On Fri, Aug 17, 2018 at 4:11 PM Lukasz Cwik  wrote:
>
>> I believe in practice SDK harnesses will fall into one of two
>> capabilities, can process effectively an infinite amount of bundles in
>> parallel or can only process a single bundle at a time.
>>
>> I believe it is more difficult for a runner to handle the latter case
>> well and to perform all the environment management that would make that
>> efficient. It may be inefficient for an SDK but I do believe it should be
>> able to say that I'm not great at anything more then a single bundle at a
>> time but utilizing this information by a runner should be optional.
>>
>>
>>
>> On Fri, Aug 17, 2018 at 1:53 PM Ankur Goenka  wrote:
>>
>>> To recap the discussion it seems that we have come-up with following
>>> point.
>>> SDKHarness Management and initialization.
>>>
>>>1. Runner completely own the work assignment to SDKHarness.
>>>2. Runner should know the capabilities and capacity of SDKHarness
>>>and should assign work accordingly.
>>>3. Spinning up of SDKHarness is runner's responsibility and it can
>>>be done statically (a fixed pre configured number of SDKHarness) or
>>>dynamically or based on certain other configuration/logic which runner
>>>choose.
>>>
>>> SDKHarness Expectation. This is more in question and we should outline
>>> the responsibility of SDKHarness.
>>>
>>>1. SDKHarness should publish how many concurrent tasks it can
>>>execute.
>>>2. SDKHarness should start executing all the tasks items assigned in
>>>parallel in a timely manner or fail task.
>>>
>>> Also to add to simplification side. I think for better adoption, we
>>> should have simple SDKHarness as well as simple Runner integration to
>>> encourage integration with more runner. Also many runners might not expose
>>> some of the internal scheduling characteristics so we should not expect
>>> scheduling characteristics for runner integration. Moreover scheduling
>>> characteristics can change based on pipeline type, infrastructure,
>>> available resource etc. So I am a bit hesitant to add runner scheduling
>>> specifics for runner integration.
>>> A good balance between SDKHarness complexity and Runner integration can
>>> be helpful in easier adoption.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> On Fri, Aug 17, 2018 at 12:22 PM Henning Rohde 
>>> wrote:
>>>
>>>> Finding a good balance is indeed the art of portability, because the
>>>> range of capability (and assumptions) on both sides is wide.
>>>>
>>>> It was originally the idea to allow the SDK harness to be an extremely
>>>> simple bundle executer (specifically, single-threaded execution one
>>>> instruction at a time) however inefficient -- a more sophisticated SDK
>>>> harness would support more features and be more efficient. For the issue
>>>> described here, it seems problematic to me to send non-executable bundles
>>>> to the SDK harness under the expectation that the SDK harness will
>>>> concurrently work its way deeply enough down the instruction queue to
>>>> unblock itself. That would be an extremely subtle requirement for SDK
>>>> authors and one practical question becomes: what should an SD

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-08-17 Thread Henning Rohde
Finding a good balance is indeed the art of portability, because the range
of capability (and assumptions) on both sides is wide.

It was originally the idea to allow the SDK harness to be an extremely
simple bundle executer (specifically, single-threaded execution one
instruction at a time) however inefficient -- a more sophisticated SDK
harness would support more features and be more efficient. For the issue
described here, it seems problematic to me to send non-executable bundles
to the SDK harness under the expectation that the SDK harness will
concurrently work its way deeply enough down the instruction queue to
unblock itself. That would be an extremely subtle requirement for SDK
authors and one practical question becomes: what should an SDK do with a
bundle instruction that it doesn't have capacity to execute? If a runner
needs to make such assumptions, I think that information should probably
rather be explicit along the lines of proposal 1 -- i.e., some kind of
negotiation between resources allotted to the SDK harness (a preliminary
variant are in the provisioning api) and what the SDK harness in return can
do (and a valid answer might be: 1 bundle at a time irrespectively of
resources given) or a per-bundle special "overloaded" error response. For
other aspects, such as side input readiness, the runner handles that
complexity and the overall bias has generally been to move complexity to
the runner side.

The SDK harness and initialization overhead is entirely SDK, job type and
even pipeline specific. A docker container is also just a process, btw, and
doesn't inherently carry much overhead. That said, on a single host, a
static docker configuration is generally a lot simpler to work with.

Henning


On Fri, Aug 17, 2018 at 10:18 AM Thomas Weise  wrote:

> It is good to see this discussed!
>
> I think there needs to be a good balance between the SDK harness
> capabilities/complexity and responsibilities. Additionally the user will
> need to be able to adjust the runner behavior, since the type of workload
> executed in the harness also is a factor. Elsewhere we already discussed
> that the current assumption of a single SDK harness instance per Flink task
> manager brings problems with it and that there needs to be more than one
> way how the runner can spin up SDK harnesses.
>
> There was the concern that instantiation if multiple SDK harnesses per TM
> host is expensive (resource usage, initialization time etc.). That may hold
> true for a specific scenario, such as batch workloads and the use of Docker
> containers. But it may look totally different for a streaming topology or
> when SDK harness is just a process on the same host.
>
> Thanks,
> Thomas
>
>
> On Fri, Aug 17, 2018 at 8:36 AM Lukasz Cwik  wrote:
>
>> SDK harnesses were always responsible for executing all work given to it
>> concurrently. Runners have been responsible for choosing how much work to
>> give to the SDK harness in such a way that best utilizes the SDK harness.
>>
>> I understand that multithreading in python is inefficient due to the
>> global interpreter lock, it would be upto the runner in this case to make
>> sure that the amount of work it gives to each SDK harness best utilizes it
>> while spinning up an appropriate number of SDK harnesses.
>>
>> On Fri, Aug 17, 2018 at 7:32 AM Maximilian Michels 
>> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thanks for looking into this problem. The cause seems to be Flink's
>>> pipelined execution mode. It runs multiple tasks in one task slot and
>>> produces a deadlock when the pipelined operators schedule the SDK
>>> harness DoFns in non-topological order.
>>>
>>> The problem would be resolved if we scheduled the tasks in topological
>>> order. Doing that is not easy because they run in separate Flink
>>> operators and the SDK Harness would have to have insight into the
>>> execution graph (which is not desirable).
>>>
>>> The easiest method, which you proposed in 1) is to ensure that the
>>> number of threads in the SDK harness matches the number of
>>> ExecutableStage DoFn operators.
>>>
>>> The approach in 2) is what Flink does as well. It glues together
>>> horizontal parts of the execution graph, also in multiple threads. So I
>>> agree with your proposed solution.
>>>
>>> Best,
>>> Max
>>>
>>> On 17.08.18 03:10, Ankur Goenka wrote:
>>> > Hi,
>>> >
>>> > tl;dr Dead Lock in task execution caused by limited task parallelism on
>>> > SDKHarness.
>>> >
>>> > *Setup:*
>>> >
>>> >   * Job type: /*Beam Portable Python Batch*/ Job on Flink standalone
>>> > cluster.
>>> >   * Only a single job is scheduled on the cluster.
>>> >   * Everything is running on a single machine with single Flink task
>>> > manager.
>>> >   * Flink Task Manager Slots is 1.
>>> >   * Flink Parallelism is 1.
>>> >   * Python SDKHarness has 1 thread.
>>> >
>>> > *Example pipeline:*
>>> > Read -> MapA -> GroupBy -> MapB -> WriteToSink
>>> >
>>> > *Issue:*
>>> > With multi stage job, Flink schedule different 

Re: Beam Docs Contributor

2018-07-30 Thread Henning Rohde
Welcome Rose! Great to have you here.

On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía  wrote:

> Welcome !
> Great to see someone new working in this important area for the project.
>
>
> On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang  wrote:
>
>> Welcome Rose!
>> ᐧ
>>
>> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  wrote:
>>
>>> Welcome!
>>>
>>> -Rui
>>>
>>> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas  wrote:
>>>
 Welcome Rose, very glad to have you in the community :)



 On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  wrote:

> Welcome Rose! Looking forward to your contributions.
>
> On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen 
> wrote:
>
>> Hi all:
>>
>> I'm Rose! I've worked on Cloud Dataflow documentation and now I'm
>> starting a project to refresh the Beam docs and improve the onboarding
>> experience. We're planning on splitting up the programming guide into
>> multiple pages, making the docs more accessible for new users. I've got
>> lots of ideas for doc improvements, some of which are motivated by the UX
>> research, and am excited to share them with you all and work on them.
>>
>> I look forward to interacting with everybody in the community. I
>> welcome comments, thoughts, feedback, etc.
>> --
>>
>>
>> Rose Thi Nguyen
>>
>>   Technical Writer
>>
>> (281) 683-6900
>>
>
>


Re: No JVM - new Runner?

2018-07-17 Thread Henning Rohde
There are essentially 2 complementary portability API surfaces that you'd
need to implement: job management incl. job submission and execution as
well as some worker deployment plumbing specific to the runner. Note that
the source of truth is the model protos -- the design docs linked from
https://beam.apache.org/contribute/portability/ and (even more so) the
website guides are not always up to date.

Currently, all runners are in Java and share numerous components and
utilities. A non-JVM runner would have to build all that from scratch --
although, as you mention, if you're using Go or Python the corresponding
SDKs likely have many pieces that can be reused. A minor potential hiccup
is that gRPC/protobuf is not natively supported everywhere, so you may end
up interoperating with the C versions of the libraries if you pick a
non-supported language. A separate challenge regardless of the language is
how directly the Beam model and primitives map to the engine.

All that said, I think it's definitely feasible to do something
interesting. Are you specifically thinking of a Go Wallaroo runner?

Thanks,
 Henning

On Tue, Jul 17, 2018 at 9:26 AM Austin Bennett 
wrote:

> Sweet; that led me to:
> https://beam.apache.org/contribute/runner-guide/#the-runner-api (which I
> can't believe I missed).
>
>
>
> On Tue, Jul 17, 2018 at 9:21 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Austin,
>>
>> If your runner provide the gRPC portabality layer (allowing any SDK to
>> "interact" with the runner), it will work no matter how the runner is
>> implemented (JVM or not).
>>
>> However, it means that you will have to mimic the Runner API for the
>> translation.
>>
>> Regards
>> JB
>>
>> On 17/07/2018 18:19, Austin Bennett wrote:
>> > Hi Beam Devs,
>> >
>> > I still don't quite understand:
>> >
>> > "Apache Beam provides a portable API layer for building sophisticated
>> > data-parallel processing pipelines that may be executed across a
>> > diversity of execution engines, or /runners/."
>> >
>> > (from https://beam.apache.org/documentation/runners/capability-matrix/)
>> >
>> > And specifically, close reading
>> > of: https://beam.apache.org/contribute/portability/
>> >
>> > What if I'd like to implement a runner that is non-JVM?  Though would
>> > leverage the Python and Go SDKs?  Specifically, thinking of:
>> >  https://www.wallaroolabs.com (I am out in NY meeting with friends
>> there
>> > later this week, and wanted to get a sense of, feasibility, work
>> > involved, etc -- to propose that we add a new Wallaroo runner).
>> >
>> > Is there a way to keep java out of the mix completely and still work
>> > with Beam on a non JVM runner (seems maybe eventually, but what about
>> > currently/near future)?
>> >
>> > Any input, thoughts, ideas, other pages or info to explore -- all
>> > appreciated; thanks!
>> > Austin
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: CODEOWNERS for apache/beam repo

2018-07-10 Thread Henning Rohde
+1. Sounds like a useful improvement.

Udi -- do the reviewers in this file need to be committers for the PR
auto-assignment to work?

On Tue, Jul 10, 2018 at 1:59 AM Łukasz Gajowy 
wrote:

> +1. It will certainly be useful. I added myself (and a fellow contributor)
> to some components (IO testing related mostly).
>
> Thanks,
> Łukasz
>
> wt., 10 lip 2018 o 02:06 Udi Meiri  napisał(a):
>
>> Hi everyone,
>>
>> I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
>> mechanism.
>> Initial version is here: *https://github.com/apache/beam/pull/5909/files
>> *
>>
>> I need help from the community in determining owners for each component.
>> Feel free to directly edit the PR (if you have permission) or add a
>> comment.
>>
>>
>> Background
>> The idea is to:
>> 1. Document good review candidates for each component.
>> 2. Help choose reviewers using the auto-assignment mechanism. The
>> suggestion is in no way binding.
>>
>>
>>


Re: Samza runner committer support

2018-06-29 Thread Henning Rohde
I would be happy to help with some of the portability-related changes, as
well as trying out Go-on-Samza. I'm out on vacation most of July, though.

On Thu, Jun 28, 2018 at 1:12 PM Ahmet Altay  wrote:

> I would be happy to help with this. I can prioritize python related
> changes. For portability related changes I can try to help but I may not be
> the best person.
>
> On Thu, Jun 28, 2018 at 12:33 PM, Xinyu Liu  wrote:
>
>> Hi, All,
>>
>> Our Samza runner has recently been merged to master, and Kenn has been
>> extremely instrumental during the whole process, e.g. design decisions,
>> feature requests and code reviews. We would like thank him for all the
>> support he has been given to us!
>>
>> Given Kenn is going to be on leave soon, we want to call out for
>> committer sponsors who will work with us in the mean time. We expect a lot
>> more upcoming updates for the Samza runner, such as the portable runner and
>> python support. We will also have some feature asks, e.g. the async API
>> briefly discussed in the previous email. Would anyone be interested in
>> being a sponsor part time for the Samza runner?
>>
>> Thanks,
>> Xinyu
>>
>
>


Re: SDK Harness Deployment

2018-06-08 Thread Henning Rohde
> a Flink task manager lifecycle isn't tied to a single job (it is a worker
that exists even before the pipeline was submitted). Therefore we will need
to fall back to (1). Boot code would fail to connect until the job was
deployed [...].

The pipeline defines which environment (= SDK harness) to use, so we
couldn't in general be able run it before the job is submitted. If you have
some more specialized scenario in mind -- which I assume here is the case
-- where the setup supports only a fixed set of SDK harnesses (or a custom
one you author) then you may also be able to use the same SDK harness to
run multiple jobs. How practical that is depends on the SDK, whether it
uses job-specific information, what kind of isolation between jobs is
needed and whether the artifacts are identical.

For Go, for example, the default container image is identical across jobs
and the sole artifact is the user binary, so to be usable across jobs all
DoFns for all jobs must be compiled into a single binary. Nothing currently
uses the job-specific information. If the SDK harness is not given
persistent storage through the 'semi_persistent_path' flag, it is stateless
and could serve multiple jobs in sequence if restarted -- assuming
provisioning/artifacts each time updated to serve information for the new
job. An "exit" instruction might be helpful to force the SDK harness to
exit gracefully. If the setup is not accepting arbitrary container images
(or not using containers at all), you could also use a special container
image to restart just the inner Go process and re-pull
provisioning/artifacts instead of the whole container if that works better.
Either way, you could have a setup where a fixed set of such Go containers
match the same number of slots of the TM and can handle multiple jobs. If
the setup needs to support arbitrary SDKs and container images, however,
then you'd be back in the world of dynamically starting them somehow.

Thanks,
 Henning

On Fri, Jun 8, 2018 at 11:45 AM Thomas Weise  wrote:

> Sounds good.
>
> Regarding (2), the bootstrapping endpoints serve job specific info, but a
> Flink task manager lifecycle isn't tied to a single job (it is a worker
> that exists even before the pipeline was submitted). Therefore we will need
> to fall back to (1). Boot code would fail to connect until the job was
> deployed and the Flink runner has established the endpoint. This should
> work fine, as long as the boot code is retried without causing the entire
> container to exit, it may just be some noise in the logs?
>
> In my scenario there won't be a second job that runs on the same task
> manager, since we are planning to deploy Flink along with the application.
> But Flink in general also supports a "session" mode where multiple jobs can
> share the same set of task managers. In that case it would be necessary to
> isolate the SDK workers because they can only serve a single job (unless
> what you have listed under static information is identical).
>
> Looking at the current runner code there will be some work in the
> JobResourceManager/SingletonSdkHarnessManager neighborhood that I can pick
> up once we have the basics working in master. Currently SDK workers can
> only be distinguished by the port they connect to, the runner does not look
> at the worker ID or makes it available in any way. So the support to
> multiplex has to be added. Perhaps Ben/Alex can comment on this?
>
> Thanks,
> Thomas
>
>
> On Fri, Jun 8, 2018 at 10:19 AM, Henning Rohde  wrote:
>
>> You're right. That is the idea.
>>
>> Two comments on the executable stage not being available yet:
>>   (1) An SDK harness may either retry or fail (exit) if it can't
>> connect/times out/gets an error. If it exits, the runner/environment is
>> responsible for restarting the process/container. So it will effectively
>> always retry. The boot code currently used tries to connect for 2 mins
>> after which it gives up (and in turn is restarted and tries again). The
>> 2min is set a bit arbitrarily, btw, so we can adjust it for the default
>> containers.
>>   (2) The 2 bootstrapping endpoints serve static information (pipeline
>> options, artifacts, and job metadata) that may not require an executable
>> stage -- for example, if the artifact service just serves data from HDFS.
>> The control endpoint is mainly driven by the runner side, so the
>> multiplexer could allow any SDK harness to connect, but it just wouldn't
>> send any actual instructions until the executable stages were ready. So for
>> 2nd jobs or if we have some global hooks into the TM (or deploy a separate
>> process -- provisioning and artifacts are separate services to make this
>> possible), it might be possible to allow the SDK harness to boot in
>> parallel

Re: SDK Harness Deployment

2018-06-08 Thread Henning Rohde
You're right. That is the idea.

Two comments on the executable stage not being available yet:
  (1) An SDK harness may either retry or fail (exit) if it can't
connect/times out/gets an error. If it exits, the runner/environment is
responsible for restarting the process/container. So it will effectively
always retry. The boot code currently used tries to connect for 2 mins
after which it gives up (and in turn is restarted and tries again). The
2min is set a bit arbitrarily, btw, so we can adjust it for the default
containers.
  (2) The 2 bootstrapping endpoints serve static information (pipeline
options, artifacts, and job metadata) that may not require an executable
stage -- for example, if the artifact service just serves data from HDFS.
The control endpoint is mainly driven by the runner side, so the
multiplexer could allow any SDK harness to connect, but it just wouldn't
send any actual instructions until the executable stages were ready. So for
2nd jobs or if we have some global hooks into the TM (or deploy a separate
process -- provisioning and artifacts are separate services to make this
possible), it might be possible to allow the SDK harness to boot in
parallel with the TM being fully ready. Disclaimer: I have a limited
understanding of the Flink constraints here.

Thanks,
 Henning



On Fri, Jun 8, 2018 at 7:49 AM Thomas Weise  wrote:

> Yes, it did not occur to me that we have the identifier available for
> this. I just took a fresh look at
> https://s.apache.org/beam-fn-api-container-contract
>
> So it should be possible to start a pool of containers with pre-assigned
> IDs in the pod, communicate the same set of IDs to the runner (via it's
> configuration) and then come up with some mechanism to assign executable
> stages to worker IDs as part of the Flink operator initialization.
>
> By the time the SDK boot code calls the provisioning service to fetch the
> pipeline options, the runner wouldn't be ready (either since the TM isn't
> running or the executable stages were not deployed into it yet). So will
> that call just retry until the endpoint becomes available? On the runner
> side, the endpoint can only be activated (under the fixed address) when the
> task slots are assigned.
>
> Thanks,
> Thomas
>
>
>
>
>
> On Wed, Jun 6, 2018 at 3:19 PM, Henning Rohde  wrote:
>
>> Thanks Thomas. The id provided to the SDK harness must be sent as a gRPC
>> header when it connects to the TM. The TM can use a fixed port and
>> multiplex requests based on that id - to match the SDK harness with the
>> appropriate job/slot/whatnot. The relationship between SDK harness and TM
>> is not limited to 1:1, but rather many:1. We'll likely need that for
>> cross-language as well. Wouldn't multiplexing on a single port for the
>> control plane be the easiest solution for both #1 and #2? The data plane
>> can still use various dynamically-allocated ports.
>>
>> On Kubernetes, we're somewhat constrained by the pod lifetime and
>> multi-job TMs might not be as natural to achieve.
>>
>> Thanks,
>>  Henning
>>
>> On Wed, Jun 6, 2018 at 2:28 PM Thomas Weise  wrote:
>>
>>> Hi Henning,
>>>
>>> Here is a page that explains the scheduling and overall functioning of
>>> the task manager in Flink:
>>>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.5/internals/job_scheduling.html
>>>
>>> Here are the 2 issues:
>>>
>>> #1 each task manager process get assigned multiple units of execution
>>> into task slots. So when we deploy a Beam pipeline, we can end up with
>>> multiple executable stages running in a single TM JVM.
>>>
>>> This where a 1-to-1 relationship between TM and SDK harness can lead to
>>> a bottleneck (all task slots of a single TM push their work to a single SDK
>>> container).
>>>
>>> #2 in a deployment where multiple pipelines share a Flink cluster, the
>>> SDK harness per TM approach wouldn't work logically. We would need to have
>>> multiple SDK containers, not just for efficiency reasons.
>>>
>>> This would not be an issue for the deployment scenario I'm looking at,
>>> but it needs to be considered for general Flink runner deployment.
>>>
>>> Regarding the assignment of fixed endpoints within the TM, that is
>>> possible but it doesn't address #1 and #2.
>>>
>>> I hope this clarifies?
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Wed, Jun 6, 2018 at 12:31 PM, Henning Rohde 
>>> wrote:
>>>
>>>> Thanks for writing down and explaining the problem, Thomas. Let me try
>>>&

Re: SDK Harness Deployment

2018-06-06 Thread Henning Rohde
Thanks Thomas. The id provided to the SDK harness must be sent as a gRPC
header when it connects to the TM. The TM can use a fixed port and
multiplex requests based on that id - to match the SDK harness with the
appropriate job/slot/whatnot. The relationship between SDK harness and TM
is not limited to 1:1, but rather many:1. We'll likely need that for
cross-language as well. Wouldn't multiplexing on a single port for the
control plane be the easiest solution for both #1 and #2? The data plane
can still use various dynamically-allocated ports.

On Kubernetes, we're somewhat constrained by the pod lifetime and multi-job
TMs might not be as natural to achieve.

Thanks,
 Henning

On Wed, Jun 6, 2018 at 2:28 PM Thomas Weise  wrote:

> Hi Henning,
>
> Here is a page that explains the scheduling and overall functioning of the
> task manager in Flink:
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/internals/job_scheduling.html
>
> Here are the 2 issues:
>
> #1 each task manager process get assigned multiple units of execution into
> task slots. So when we deploy a Beam pipeline, we can end up with multiple
> executable stages running in a single TM JVM.
>
> This where a 1-to-1 relationship between TM and SDK harness can lead to a
> bottleneck (all task slots of a single TM push their work to a single SDK
> container).
>
> #2 in a deployment where multiple pipelines share a Flink cluster, the SDK
> harness per TM approach wouldn't work logically. We would need to have
> multiple SDK containers, not just for efficiency reasons.
>
> This would not be an issue for the deployment scenario I'm looking at, but
> it needs to be considered for general Flink runner deployment.
>
> Regarding the assignment of fixed endpoints within the TM, that is
> possible but it doesn't address #1 and #2.
>
> I hope this clarifies?
>
> Thanks,
> Thomas
>
>
> On Wed, Jun 6, 2018 at 12:31 PM, Henning Rohde  wrote:
>
>> Thanks for writing down and explaining the problem, Thomas. Let me try to
>> tease some of the topics apart.
>>
>> First, the basic setup is currently as follows: there are 2 worker
>> processes (A) "SDK harness" and (B) "Runner harness" that needs to
>> communicate. A connects to B. The fundamental endpoint(s) of B as well as
>> an id -- logging, provisioning, artifacts and control -- are provided to A
>> via command line parameters. A is not expected to be able to connect to the
>> control port without first obtaining pipeline options (from provisioning)
>> and staged files (from artifacts). As an side, this is where the separate
>> boot.go code comes in handy. A can assume it will be restarted, if it
>> exits. A does not assume the given endpoints are up when started and should
>> make blocking calls with timeout (but if not and exits, it is restarted
>> anyway and will retry). Note that the data plane endpoints are part of the
>> control instructions and need not be known or allocated at startup or even
>> be served by the same TM.
>>
>> Second, whether or not docker is used is rather an implementation detail,
>> but if we use Kubernetes (or other such options) then some constraints come
>> into play.
>>
>> Either way, two scenarios work well:
>>(1) B starts A: The ULR and Flink prototype does this. B will delay
>> starting A until it has decided which endpoints to use. This approach
>> requires B to do process/container management, which we'd rather not have
>> to do at scale. But it's convenient for local runners.
>>(2) B has its (local) endpoints configured or fixed: A and B can be
>> started concurrently. Dataflow does this. Kubernetes lends itself well to
>> this approach (and handles container management for us).
>>
>> The Flink on Kubernetes scenario described above doesn't:
>>(3) B must use randomized (local) endpoints _and_ A and B are started
>> concurrently: A would not know where to connect.
>>
>> Perhaps I'm not understanding the constraints of the TM well enough, but
>> can we really not open a configured/fixed port from the TM -- especially in
>> a network-isolated Kubernetes pod? Adding a third process (C) "proxy" to
>> the pod might by an alternative option and morph (3) into (2). B would
>> configure C when it's ready. A would connect to C, but be blocked until B
>> has configured it. C could perhaps even serve logging, provisioning, and
>> artifacts without B. And the data plane would not go over C anyway. If
>> control proxy'ing is a concern, then alternatively we would add an
>> indirection to the container contract and provide the control endpoint in
>> the provisioning a

Re: SDK Harness Deployment

2018-06-06 Thread Henning Rohde
Thanks for writing down and explaining the problem, Thomas. Let me try to
tease some of the topics apart.

First, the basic setup is currently as follows: there are 2 worker
processes (A) "SDK harness" and (B) "Runner harness" that needs to
communicate. A connects to B. The fundamental endpoint(s) of B as well as
an id -- logging, provisioning, artifacts and control -- are provided to A
via command line parameters. A is not expected to be able to connect to the
control port without first obtaining pipeline options (from provisioning)
and staged files (from artifacts). As an side, this is where the separate
boot.go code comes in handy. A can assume it will be restarted, if it
exits. A does not assume the given endpoints are up when started and should
make blocking calls with timeout (but if not and exits, it is restarted
anyway and will retry). Note that the data plane endpoints are part of the
control instructions and need not be known or allocated at startup or even
be served by the same TM.

Second, whether or not docker is used is rather an implementation detail,
but if we use Kubernetes (or other such options) then some constraints come
into play.

Either way, two scenarios work well:
   (1) B starts A: The ULR and Flink prototype does this. B will delay
starting A until it has decided which endpoints to use. This approach
requires B to do process/container management, which we'd rather not have
to do at scale. But it's convenient for local runners.
   (2) B has its (local) endpoints configured or fixed: A and B can be
started concurrently. Dataflow does this. Kubernetes lends itself well to
this approach (and handles container management for us).

The Flink on Kubernetes scenario described above doesn't:
   (3) B must use randomized (local) endpoints _and_ A and B are started
concurrently: A would not know where to connect.

Perhaps I'm not understanding the constraints of the TM well enough, but
can we really not open a configured/fixed port from the TM -- especially in
a network-isolated Kubernetes pod? Adding a third process (C) "proxy" to
the pod might by an alternative option and morph (3) into (2). B would
configure C when it's ready. A would connect to C, but be blocked until B
has configured it. C could perhaps even serve logging, provisioning, and
artifacts without B. And the data plane would not go over C anyway. If
control proxy'ing is a concern, then alternatively we would add an
indirection to the container contract and provide the control endpoint in
the provisioning api, say, or even a new discovery service.

There are of course other options and tradeoffs, but having Flink work on
Kubernetes and not go against the grain seems desirable to me.

Thanks,
 Henning


On Wed, Jun 6, 2018 at 10:11 AM Thomas Weise  wrote:

> Hi,
>
> The current plan for running the SDK harness is to execute docker to
> launch SDK containers with service endpoints provided by the runner in the
> docker command line.
>
> In the case of Flink runner (prototype), the service endpoints are
> dynamically allocated per executable stage. There is typically one Flink
> task manager running per machine. Each TM has multiple task slots. A subset
> of these task slots will run the Beam executable stages. Flink allows
> multiple jobs in one TM, so we could have executable stages of different
> pipelines running in a single TM, depending on how users deploy. The
> prototype also has no cleanup for the SDK containers, they remain running
> and orphaned once the runner is gone.
>
> I'm trying to find out how this approach can be augmented for deployment
> on Kubernetes. Our deployments won't allow multiple jobs per task manager,
> so all task slots will belong to the same pipeline context. The intent is
> to deploy SDK harness containers along with TMs in the same pod. No
> assumption can be made about the order in which the containers are started,
> and the SDK container wouldn't know the connect address at startup (it can
> only be discovered after the pipeline gets deployed into the TMs).
>
> I talked about that a while ago with Henning and one idea was to set a
> fixed endpoint address so that the boot code in the SDK container knows
> upfront where to connect to, even when that endpoint isn't available yet.
> This approach may work with minimal changes to runner and little or no
> change to SDK container (as long as the SDK is prepared to retry). The
> downside is that all (parallel) task slots of the TM will use the same SDK
> worker, which will likely lead to performance issues, at least with the
> Python SDK that we are planning to use.
>
> An alternative may be to define an SDK worker pool per pod, with a
> discovery mechanism for workers to find the runner endpoints and a
> coordination mechanism that distributes the dynamically allocated endpoints
> that are provided by the executable stage task slots over the available
> workers.
>
> Any thoughts on this? Is anyone else looking at a docker free deployment?
>
> 

Re: Go SDK Example=

2018-06-04 Thread Henning Rohde
Welcome James!

Awesome that you're interested in contributing to Apache Beam! If you're
specifically interested in the Go SDK, the task you identified is a good
one to start with. I assigned it to you. I also added a few similar tasks
listed below as alternatives. Feel free to pick the one you prefer and
re-assign as appropriate (or I can do it for you). It's best that the JIRAs
are assigned before any work is done so avoid accidental duplication.

  BEAM-4466 Add Go TF-IDF example
  BEAM-4467 Add Go Autocomplete example

The main caveat for Go streaming pipelines is that they currently only
really work on Dataflow, because the only streaming IO connector is PubSub
and the direct runner supports batch only. In the near future, however, the
ULR and Flink runner will support portable streaming pipelines, including
Go. If it is too impractical to work with the IO used by corresponding
Java/Python examples, feel free to deviate by using textio or similar
instead. There may also be incomplete feature work in Go that prevents an
direct translation.

Please feel to ask questions in the JIRAs or on the dev list. Happy to help!

Henning



On Sun, Jun 3, 2018 at 6:41 PM Kenneth Knowles  wrote:

> Hi James,
>
> Welcome!
>
> Have you subscribed to dev@beam.apache.org? I am including that list
> here, since that is the most active list for discussing contributions. I've
> also included Henning explicitly. He is the best person to answer.
>
> I found your JIRA account and set up permissions so you can be assigned
> issues.
>
> Kenn
>
> On Sun, Jun 3, 2018 at 12:35 PM James Wilson  wrote:
>
>> Hi All,
>>
>> This is first time I am trying to contribute to a large open source
>> project.  I was going to tackle the BEAM-4292 "Add streaming word count
>> example" for the Go SDK.  Do I assign it to myself or just complete the
>> task and create a PR request?  I read through the contributing page on the
>> Apache Beam site, but it didn’t go into how to tackle your first task.  Any
>> help would be appreciated.
>>
>> Best,
>> James
>
>


Re: [VOTE] Code Review Process

2018-06-01 Thread Henning Rohde
+1

On Fri, Jun 1, 2018 at 12:27 PM Dan Halperin  wrote:

> +1 -- this is encoding what I previously thought the process was and what,
> in practice, I think was often the behavior of committers anyway.
>
> On Fri, Jun 1, 2018 at 12:21 PM, Yifan Zou  wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 12:10 PM Robert Bradshaw 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 12:06 PM Chamikara Jayalath 
>>> wrote:
>>>
 +1

 Thanks,
 Cham

 On Fri, Jun 1, 2018 at 11:36 AM Jason Kuster 
 wrote:

> +1
>
> On Fri, Jun 1, 2018 at 11:36 AM Ankur Goenka 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 11:28 AM Charles Chen  wrote:
>>
>>> +1
>>>
>>> On Fri, Jun 1, 2018 at 11:20 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 +1

 On Fri, Jun 1, 2018 at 10:40 AM, Ahmet Altay 
 wrote:

> +1
>
> On Fri, Jun 1, 2018 at 10:37 AM, Kenneth Knowles 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 10:25 AM Thomas Groh 
>> wrote:
>>
>>> As we seem to largely have consensus in "Reducing Committer Load
>>> for Code Reviews"[1], this is a vote to change the Beam policy on 
>>> Code
>>> Reviews to require that
>>>
>>> (1) At least one committer is involved with the code review, as
>>> either a reviewer or as the author
>>> (2) A contributor has approved the change
>>>
>>> prior to merging any change.
>>>
>>> This changes our policy from its current requirement that at
>>> least one committer *who is not the author* has approved the change 
>>> prior
>>> to merging. We believe that changing this process will improve code 
>>> review
>>> throughput, reduce committer load, and engage more of the community 
>>> in the
>>> code review process.
>>>
>>> Please vote:
>>> [ ] +1: Accept the above proposal to change the Beam code
>>> review/merge policy
>>> [ ] -1: Leave the Code Review policy unchanged
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/7c1fde3884fbefacc252b6d4b434f9a9c2cf024f381654aa3e47df18@%3Cdev.beam.apache.org%3E
>>>
>>
>

>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> 
>

>


Re: [VOTE] Use probot/stale to automatically manage stale pull requests

2018-06-01 Thread Henning Rohde
+1

On Fri, Jun 1, 2018 at 10:16 AM Chamikara Jayalath 
wrote:

> +1 (non-binding).
>
> Thanks,
> Cham
>
> On Fri, Jun 1, 2018 at 10:05 AM Kenneth Knowles  wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 9:54 AM Scott Wegner  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Jun 1, 2018 at 9:39 AM Ahmet Altay  wrote:
>>>
 +1

 On Fri, Jun 1, 2018, 9:32 AM Jason Kuster 
 wrote:

> +1 (non-binding): automating policy ensures it is applied fairly and
> evenly and lessens the load on project maintainers; hearty agreement.
>
> On Fri, Jun 1, 2018 at 9:25 AM Alan Myrvold 
> wrote:
>
>> +1 (non-binding) I updated the pull request to be 60 days (instead of
>> 90) to match the contribute policy.
>>
>> On Fri, Jun 1, 2018 at 9:21 AM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Following the discussion, please vote on the move to activate
>>> probot/stale [3] to notify authors of stale PRs per current policy
>>> and then close them after a 7 day grace period.
>>>
>>> For more details, see:
>>>
>>>  - our stale PR policy [1]
>>>  - the discussion thread [2]
>>>  - Probot stale [3]
>>>  - BEAM ticket summarizing discussion [4]
>>>  - INFRA ticket to activate probot/stale [5]
>>>  - Example PR that would activate it [6]
>>>
>>> Please vote:
>>> [ ] +1, Approve that we activate probot/stale
>>> [ ] -1, Do not approve (please provide specific comments)
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/#stale-pull-requests
>>> [2]
>>> https://lists.apache.org/thread.html/bda552ea7073ca165aaf47034610afafe22d589e386525023d33609e@%3Cdev.beam.apache.org%3E
>>> [3] https://github.com/probot/stale
>>> [4] https://issues.apache.org/jira/browse/BEAM-4423
>>> [5] https://issues.apache.org/jira/browse/INFRA-16589
>>> [6] https://github.com/apache/beam/pull/5532
>>>
>>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
> 
>



Re: [ANNOUNCEMENT] New committers, May 2018 edition!

2018-06-01 Thread Henning Rohde
Congratulations!

On Fri, Jun 1, 2018 at 7:03 AM Jesse Anderson 
wrote:

> Welcome!
>
> On Fri, Jun 1, 2018, 2:02 AM Etienne Chauchot 
> wrote:
>
>> Congrats to all !
>> Le jeudi 31 mai 2018 à 19:08 -0700, Davor Bonaci a écrit :
>>
>> Please join me and the rest of Beam PMC in welcoming the following
>> contributors as our newest committers. They have significantly contributed
>> to the project in different ways, and we look forward to many more
>> contributions in the future.
>>
>> * Griselda Cuevas
>> * Pablo Estrada
>> * Jason Kuster
>>
>> (Apologizes for a delayed announcement, and the lack of the usual
>> paragraph summarizing individual contributions.)
>>
>> Congratulations to all three! Welcome!
>>
>>


Re: [VOTE] Go SDK

2018-06-01 Thread Henning Rohde
Thanks, Davor!

On Thu, May 31, 2018 at 7:10 PM Davor Bonaci  wrote:

> The IP clearance document has been filed into Foundation records, and is
> currently under review. No further action necessary, unless we hear back.
>
> On Fri, May 25, 2018 at 10:31 AM, Henning Rohde 
> wrote:
>
>> Thanks a lot, Davor! Much appreciated.
>>
>> Thanks,
>>  Henning
>>
>> On Fri, May 25, 2018 at 10:26 AM Davor Bonaci  wrote:
>>
>>> ETA: weekend.
>>>
>>> On Fri, May 25, 2018 at 9:35 AM Henning Rohde 
>>> wrote:
>>>
>>>> RESULT: the vote passed with only +1s! Thanks you all for the kind
>>>> comments.
>>>>
>>>> The only pending item is the IP clearance form (draft:
>>>> https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
>>>> <https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>).
>>>> Are there any ASF members who can help getting it recorded?
>>>>
>>>> Thanks,
>>>>  Henning
>>>>
>>>>
>>>> On Wed, May 23, 2018 at 2:45 PM Henning Rohde 
>>>> wrote:
>>>>
>>>>> Thanks Davor! I filled out the form to the best of my ability and
>>>>> placed it here (avoiding attachments on the list):
>>>>>
>>>>> https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
>>>>> <https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>
>>>>>
>>>>> Please take a look and let me know if you need anything more from me.
>>>>>
>>>>> Thanks,
>>>>>  Henning
>>>>>
>>>>> On Wed, May 23, 2018 at 8:51 AM Thomas Groh  wrote:
>>>>>
>>>>>> +1!
>>>>>>
>>>>>> I, for one, could not be more excited about our glorious portable
>>>>>> future.
>>>>>>
>>>>>> On Mon, May 21, 2018 at 6:03 PM Henning Rohde 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Now that the remaining issues have been resolved as discussed, I'd
>>>>>>> like to propose a formal vote on accepting the Go SDK into master. The 
>>>>>>> main
>>>>>>> practical difference is that the Go SDK would be part of the Apache Beam
>>>>>>> release going forward.
>>>>>>>
>>>>>>> Highlights of the Go SDK:
>>>>>>>  * Go user experience with natively-typed DoFns with (simulated)
>>>>>>> generic types
>>>>>>>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten,
>>>>>>> Combine, Windowing, ..
>>>>>>>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
>>>>>>> extensible textio.
>>>>>>>  * Supports the portability framework for both batch and streaming,
>>>>>>> notably the upcoming portable Flink runner
>>>>>>>  * Supports a direct runner for small batch workloads and testing.
>>>>>>>  * Includes pre-commit tests and post-commit integration tests.
>>>>>>>
>>>>>>> And last but not least
>>>>>>>  *  includes contributions from several independent users and
>>>>>>> developers, notably an IO connector for Datastore!
>>>>>>>
>>>>>>> Website: https://beam.apache.org/documentation/sdks/go/
>>>>>>> Code: https://github.com/apache/beam/tree/master/sdks/go
>>>>>>> Design: https://s.apache.org/beam-go-sdk-design-rfc
>>>>>>>
>>>>>>> Please vote:
>>>>>>> [ ] +1, Approve that the Go SDK becomes an official part of Beam
>>>>>>> [ ] -1, Do not approve (please provide specific comments)
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  The Gophers of Apache Beam
>>>>>>>
>>>>>>>
>>>>>>>
>


Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Henning Rohde
+1

On Thu, May 31, 2018 at 8:55 AM Thomas Weise  wrote:

> +1 to the goal of increasing review bandwidth
>
> In addition to the proposed reviewer requirement change, perhaps there are
> other ways to contribute towards that goal as well?
>
> The discussion so far has focused on how more work can get done with the
> same pool of committers or how committers can get their work done faster.
> But ASF is really about "community over code" and in that spirit maybe we
> can consider how community growth can lead to similar effects? One way I
> can think of is that besides code contributions existing committers and
> especially the PMC members can help more towards growing the committer
> base, by mentoring contributors and helping them with their contributions
> and learning the ASF way of doing things. That seems a way to scale the
> project in the long run.
>
> I'm not super excited about the concepts of "owner" and "maintainer" often
> found in (non ASF) projects like Kenn mentions. Depending on the exact
> interpretation, these have the potential of establishing an artificial
> barrier and limiting growth/sustainability in the contributor base. Such
> powers tend to be based on historical accomplishments vs. current situation.
>
> Thanks,
> Thomas
>
>
> On Thu, May 31, 2018 at 7:35 AM, Etienne Chauchot 
> wrote:
>
>> Le jeudi 31 mai 2018 à 06:17 -0700, Robert Burke a écrit :
>>
>> +1 I also thought this was the norm.
>>
>>  My read of the committer/contributor guide was that a committer couldn't
>> unilaterally merge their own code (approval/LGTM needs to come from
>> someone  familiar with the component), rather than every review needs two
>> committers. I don't recall a requirement than each PR have two committees
>> attached, which I agree is burdensome especially for new contributors.
>>
>> Yes me too, I thought exactly the same
>>
>>
>> On Wed, May 30, 2018, 2:23 PM Udi Meiri  wrote:
>>
>> I thought this was the norm already? I have been the sole reviewer a few
>> PRs by committers and I'm only a contributor.
>>
>> +1
>>
>> On Wed, May 30, 2018 at 2:13 PM Kenneth Knowles  wrote:
>>
>> ++1
>>
>> This is good reasoning. If you trust someone with the committer
>> responsibilities [1] you should trust them to find an appropriate reviewer.
>>
>> Also:
>>
>>  - adds a new way for non-committers and committers to bond
>>  - makes committers seem less like gatekeepers because it goes both ways
>>  - might help clear PR backlog, improving our community response latency
>>  - encourages committers to code*
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/
>>
>> *With today's system, if a committer and a few non-committers are working
>> together, then when the committer writes code it is harder to get it merged
>> because it takes an extra committer. It is easier to have non-committers
>> write all the code and the committer just does reviews. It is 1 committer
>> vs 2 being involved. This used to be fine when almost everyone was a
>> committer and all working on the core, but it is not fine any more.
>>
>> On Wed, May 30, 2018 at 12:50 PM Thomas Groh  wrote:
>>
>> Hey all;
>>
>> I've been thinking recently about the process we have for committing
>> code, and our current process. I'd like to propose that we change our
>> current process to require at least one committer is present for each code
>> review, but remove the need to have a second committer review the code
>> prior to submission if the original contributor is a committer.
>>
>> Generally, if we trust someone with the ability to merge code that
>> someone else has written, I think it's sensible to also trust them to
>> choose a capable reviewer. We expect that all of the people that we have
>> recognized as committers will maintain the project's quality bar - and
>> that's true for both code they author and code they review. Given that, I
>> think it's sensible to expect a committer will choose a reviewer who is
>> versed in the component they are contributing to who can provide insight
>> and will also hold up the quality bar.
>>
>> Making this change will help spread the review load out among regular
>> contributors to the project, and reduce bottlenecks caused by committers
>> who have few other committers working on their same component. Obviously,
>> this requires that committers act with the best interests of the project
>> when they send out their code for reviews - but this is the behavior we
>> demand before someone is recognized as a committer, so I don't see why that
>> would be cause for concern.
>>
>> Yours,
>>
>> Thomas
>>
>>
>


Re: Go build failure

2018-05-28 Thread Henning Rohde
Hi Colm,

 The gradle plugin installs Go for you. I tried "./gradlew check" on my
machine: the error is "go vet" itself failing due to analyzing our
dependencies under vendor:

/Users/herohde/go/src/
github.com/apache/beam/sdks/go/test/vendor/github.com/coreos/etcd/tools/etcd-test-proxy/main.go:47:
Fprintln arg list ends with redundant newline

/Users/herohde/go/src/
github.com/apache/beam/sdks/go/test/vendor/github.com/dgrijalva/jwt-go/errors.go:54:
unreachable code

/Users/herohde/go/src/
github.com/apache/beam/sdks/go/test/vendor/github.com/ghodss/yaml/yaml.go:276:
unreachable code
[...]

Not sure whether this is something we can fix in our configuration or
whether it's rather a bug in the gogradle plugin. Given that there are no
go vet problems in our code, we'd want that check to pass. Feel free to
open a bug and I can take a deeper look at it.

For now, it seems you'd have to exclude the go targets for "./gradlew
check". Note that if you just want to build the code, you should run
"./gradlew build" which works.

Henning

On Mon, May 28, 2018 at 7:58 AM Jean-Baptiste Onofré 
wrote:

> Hi Colm,
>
> it worked for me (with build) during the weekend.
>
> Do you have golang installed on your machine ?
>
> Regards
> JB
>
> On 28/05/2018 16:50, Colm O hEigeartaigh wrote:
> > Hi all,
> >
> > The following fails for me with "./gradlew check":
> >
> > Execution failed for task ':beam-sdks-go:vet'.
> >> Build failed due to return code 1 of:
> >   Command:
> >/home/coheig/.gradle/go/binary/1.10/go/bin/go tool vet
> > /home/coheig/src/apache/beam/sdks/go/test
> > /home/coheig/src/apache/beam/sdks/go/pkg
> > /home/coheig/src/apache/beam/sdks/go/data
> > /home/coheig/src/apache/beam/sdks/go/cmd
> > /home/coheig/src/apache/beam/sdks/go/build
> > /home/coheig/src/apache/beam/sdks/go/examples
> > /home/coheig/src/apache/beam/sdks/go/container
> >   Env:
> >GOEXE=
> >GOPATH=/home/coheig/src/apache/beam/sdks/go/.gogradle/project_gopath
> >GOROOT=/home/coheig/.gradle/go/binary/1
> .10/go
> >GOOS=linux
> >GOARCH=amd64
> >
> > Should the gradle configuration be setting up Go, or is it a requirement
> > to install it separately? Or is this failure caused by something else?
> >
> > Thanks,
> >
> > Colm.
> >
> >
> > --
> > Colm O hEigeartaigh
> >
> > Talend Community Coder
> > http://coders.talend.com
>
> --
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Go SDK

2018-05-25 Thread Henning Rohde
Thanks a lot, Davor! Much appreciated.

Thanks,
 Henning

On Fri, May 25, 2018 at 10:26 AM Davor Bonaci <da...@apache.org> wrote:

> ETA: weekend.
>
> On Fri, May 25, 2018 at 9:35 AM Henning Rohde <hero...@google.com> wrote:
>
>> RESULT: the vote passed with only +1s! Thanks you all for the kind
>> comments.
>>
>> The only pending item is the IP clearance form (draft:
>> https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
>> <https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>).
>> Are there any ASF members who can help getting it recorded?
>>
>> Thanks,
>>  Henning
>>
>>
>> On Wed, May 23, 2018 at 2:45 PM Henning Rohde <hero...@google.com> wrote:
>>
>>> Thanks Davor! I filled out the form to the best of my ability and placed
>>> it here (avoiding attachments on the list):
>>>
>>> https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
>>> <https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>
>>>
>>> Please take a look and let me know if you need anything more from me.
>>>
>>> Thanks,
>>>  Henning
>>>
>>> On Wed, May 23, 2018 at 8:51 AM Thomas Groh <tg...@google.com> wrote:
>>>
>>>> +1!
>>>>
>>>> I, for one, could not be more excited about our glorious portable
>>>> future.
>>>>
>>>> On Mon, May 21, 2018 at 6:03 PM Henning Rohde <hero...@google.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Now that the remaining issues have been resolved as discussed, I'd
>>>>> like to propose a formal vote on accepting the Go SDK into master. The 
>>>>> main
>>>>> practical difference is that the Go SDK would be part of the Apache Beam
>>>>> release going forward.
>>>>>
>>>>> Highlights of the Go SDK:
>>>>>  * Go user experience with natively-typed DoFns with (simulated)
>>>>> generic types
>>>>>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
>>>>> Windowing, ..
>>>>>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
>>>>> extensible textio.
>>>>>  * Supports the portability framework for both batch and streaming,
>>>>> notably the upcoming portable Flink runner
>>>>>  * Supports a direct runner for small batch workloads and testing.
>>>>>  * Includes pre-commit tests and post-commit integration tests.
>>>>>
>>>>> And last but not least
>>>>>  *  includes contributions from several independent users and
>>>>> developers, notably an IO connector for Datastore!
>>>>>
>>>>> Website: https://beam.apache.org/documentation/sdks/go/
>>>>> Code: https://github.com/apache/beam/tree/master/sdks/go
>>>>> Design: https://s.apache.org/beam-go-sdk-design-rfc
>>>>>
>>>>> Please vote:
>>>>> [ ] +1, Approve that the Go SDK becomes an official part of Beam
>>>>> [ ] -1, Do not approve (please provide specific comments)
>>>>>
>>>>> Thanks,
>>>>>  The Gophers of Apache Beam
>>>>>
>>>>>
>>>>>


Re: [VOTE] Go SDK

2018-05-25 Thread Henning Rohde
RESULT: the vote passed with only +1s! Thanks you all for the kind comments.

The only pending item is the IP clearance form (draft:
https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
<https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>).
Are there any ASF members who can help getting it recorded?

Thanks,
 Henning


On Wed, May 23, 2018 at 2:45 PM Henning Rohde <hero...@google.com> wrote:

> Thanks Davor! I filled out the form to the best of my ability and placed
> it here (avoiding attachments on the list):
>
> https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
> <https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>
>
> Please take a look and let me know if you need anything more from me.
>
> Thanks,
>  Henning
>
> On Wed, May 23, 2018 at 8:51 AM Thomas Groh <tg...@google.com> wrote:
>
>> +1!
>>
>> I, for one, could not be more excited about our glorious portable future.
>>
>> On Mon, May 21, 2018 at 6:03 PM Henning Rohde <hero...@google.com> wrote:
>>
>>> Hi everyone,
>>>
>>> Now that the remaining issues have been resolved as discussed, I'd like
>>> to propose a formal vote on accepting the Go SDK into master. The main
>>> practical difference is that the Go SDK would be part of the Apache Beam
>>> release going forward.
>>>
>>> Highlights of the Go SDK:
>>>  * Go user experience with natively-typed DoFns with (simulated)
>>> generic types
>>>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
>>> Windowing, ..
>>>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
>>> extensible textio.
>>>  * Supports the portability framework for both batch and streaming,
>>> notably the upcoming portable Flink runner
>>>  * Supports a direct runner for small batch workloads and testing.
>>>  * Includes pre-commit tests and post-commit integration tests.
>>>
>>> And last but not least
>>>  *  includes contributions from several independent users and
>>> developers, notably an IO connector for Datastore!
>>>
>>> Website: https://beam.apache.org/documentation/sdks/go/
>>> Code: https://github.com/apache/beam/tree/master/sdks/go
>>> Design: https://s.apache.org/beam-go-sdk-design-rfc
>>>
>>> Please vote:
>>> [ ] +1, Approve that the Go SDK becomes an official part of Beam
>>> [ ] -1, Do not approve (please provide specific comments)
>>>
>>> Thanks,
>>>  The Gophers of Apache Beam
>>>
>>>
>>>


Re: [VOTE] Go SDK

2018-05-23 Thread Henning Rohde
Thanks Davor! I filled out the form to the best of my ability and placed it
here (avoiding attachments on the list):

https://web.tresorit.com/l#nUkKlgi3cBYxYAOyhCMXIw
<https://www.google.com/url?q=https://web.tresorit.com/l%23nUkKlgi3cBYxYAOyhCMXIw=D=hangouts=1527197425211000=AFQjCNH1eE-U8q-8PsgkiKFsSIfxz49lbw>

Please take a look and let me know if you need anything more from me.

Thanks,
 Henning

On Wed, May 23, 2018 at 8:51 AM Thomas Groh <tg...@google.com> wrote:

> +1!
>
> I, for one, could not be more excited about our glorious portable future.
>
> On Mon, May 21, 2018 at 6:03 PM Henning Rohde <hero...@google.com> wrote:
>
>> Hi everyone,
>>
>> Now that the remaining issues have been resolved as discussed, I'd like
>> to propose a formal vote on accepting the Go SDK into master. The main
>> practical difference is that the Go SDK would be part of the Apache Beam
>> release going forward.
>>
>> Highlights of the Go SDK:
>>  * Go user experience with natively-typed DoFns with (simulated) generic
>> types
>>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
>> Windowing, ..
>>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
>> extensible textio.
>>  * Supports the portability framework for both batch and streaming,
>> notably the upcoming portable Flink runner
>>  * Supports a direct runner for small batch workloads and testing.
>>  * Includes pre-commit tests and post-commit integration tests.
>>
>> And last but not least
>>  *  includes contributions from several independent users and developers,
>> notably an IO connector for Datastore!
>>
>> Website: https://beam.apache.org/documentation/sdks/go/
>> Code: https://github.com/apache/beam/tree/master/sdks/go
>> Design: https://s.apache.org/beam-go-sdk-design-rfc
>>
>> Please vote:
>> [ ] +1, Approve that the Go SDK becomes an official part of Beam
>> [ ] -1, Do not approve (please provide specific comments)
>>
>> Thanks,
>>  The Gophers of Apache Beam
>>
>>
>>


Re: [VOTE] Go SDK

2018-05-21 Thread Henning Rohde
Thanks everyone!

Davor -- regarding your two comments:
  * Robert mentioned that "SGA should have probably already been filed" in
the previous thread. I got the impression that nothing further was needed.
I'll follow up.
  * The standard Go tooling basically always pulls directly from github, so
there is no real urgency here.

Thanks,
 Henning


On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> +1 (binding)
>
> I just want to check about SGA/IP/Headers.
>
> Thanks !
> Regards
> JB
>
> On 22/05/2018 03:02, Henning Rohde wrote:
> > Hi everyone,
> >
> > Now that the remaining issues have been resolved as discussed, I'd like
> > to propose a formal vote on accepting the Go SDK into master. The main
> > practical difference is that the Go SDK would be part of the Apache Beam
> > release going forward.
> >
> > Highlights of the Go SDK:
> >   * Go user experience with natively-typed DoFns with (simulated)
> > generic types
> >   * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
> > Windowing, ..
> >   * Includes several IO connectors: Datastore, BigQuery, PubSub,
> > extensible textio.
> >   * Supports the portability framework for both batch and streaming,
> > notably the upcoming portable Flink runner
> >   * Supports a direct runner for small batch workloads and testing.
> >   * Includes pre-commit tests and post-commit integration tests.
> >
> > And last but not least
> >   *  includes contributions from several independent users and
> > developers, notably an IO connector for Datastore!
> >
> > Website: https://beam.apache.org/documentation/sdks/go/
> > Code: https://github.com/apache/beam/tree/master/sdks/go
> > Design: https://s.apache.org/beam-go-sdk-design-rfc
> >
> > Please vote:
> > [ ] +1, Approve that the Go SDK becomes an official part of Beam
> > [ ] -1, Do not approve (please provide specific comments)
> >
> > Thanks,
> >   The Gophers of Apache Beam
> >
> >
>


[VOTE] Go SDK

2018-05-21 Thread Henning Rohde
Hi everyone,

Now that the remaining issues have been resolved as discussed, I'd like to
propose a formal vote on accepting the Go SDK into master. The main
practical difference is that the Go SDK would be part of the Apache Beam
release going forward.

Highlights of the Go SDK:
 * Go user experience with natively-typed DoFns with (simulated) generic
types
 * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
Windowing, ..
 * Includes several IO connectors: Datastore, BigQuery, PubSub, extensible
textio.
 * Supports the portability framework for both batch and streaming, notably
the upcoming portable Flink runner
 * Supports a direct runner for small batch workloads and testing.
 * Includes pre-commit tests and post-commit integration tests.

And last but not least
 *  includes contributions from several independent users and developers,
notably an IO connector for Datastore!

Website: https://beam.apache.org/documentation/sdks/go/
Code: https://github.com/apache/beam/tree/master/sdks/go
Design: https://s.apache.org/beam-go-sdk-design-rfc

Please vote:
[ ] +1, Approve that the Go SDK becomes an official part of Beam
[ ] -1, Do not approve (please provide specific comments)

Thanks,
 The Gophers of Apache Beam


Re: The Go SDK got accidentally merged - options to deal with the pain

2018-05-21 Thread Henning Rohde
Hi everyone,

 Thanks again for your patience. The last remaining Go SDK items are now
resolved and the beam website has been updated! I'll start a separate
thread for the formal vote shortly.

Thanks,
 Henning


On Thu, Apr 19, 2018 at 5:42 PM Henning Rohde <hero...@google.com> wrote:

> Hi everyone,
>
>  Thank you all for your patience. The last major identified feature (Go
> windowing) is now in review: https://github.com/apache/beam/pull/5179.
> The remaining work listed under
>
>  https://issues.apache.org/jira/browse/BEAM-2083
>
> is integration tests and documentation (quickstart, etc). I expect that
> will take a few weeks after which we should be in a position to do a vote
> about making the Go SDK an official Beam SDK. To this end, please do take a
> look at the listed tasks and let me know if there us anything missing.
>
> Lastly, I have a practical question: how should we order the PRs to the
> beam site documentation wrt the vote? Should we get PRs accepted, but not
> committed before a vote? Or just commit them as they are ready to avoid
> potential merge conflicts?
>
> Thanks!
>
> Henning
>
>
>
>
> On Sat, Mar 10, 2018 at 10:45 AM Henning Rohde <hero...@google.com> wrote:
>
>> Thank you all! I've added the remaining work -- as I understand it -- as
>> dependencies to the overall Go SDK issue (tracking the "official" merge to
>> master):
>>
>> https://issues.apache.org/jira/browse/BEAM-2083
>>
>> Please feel free to add to this list or expand the items, if there is
>> anything I overlooked. If this presence of the Go SDK in master cause
>> issues for other modules, please simply file a bug against me and I'll take
>> care of it.
>>
>> Robert - I understand your last reply as addressing Davor's points.
>> Please let me know if there is anything I need to do in that regard.
>>
>> Henning
>>
>>
>>
>> On Fri, Mar 9, 2018 at 8:39 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> +1 to let it evolve in master (+Davor points), having ongoing work on
>>> master makes sense given the state of advance + the hope that this
>>> won't add any issue for the other modules.
>>>
>>> On Thu, Mar 8, 2018 at 7:30 PM, Robert Bradshaw <rober...@google.com>
>>> wrote:
>>> > +1 to both of these points. SGA should have probably already been
>>> filed, and
>>> > excising this from releases should be easy, but I added a line item to
>>> the
>>> > validation checklist template to make sure we don't forget.
>>> >
>>> > On Thu, Mar 8, 2018 at 7:13 AM Davor Bonaci <da...@apache.org> wrote:
>>> >>
>>> >> I support leaving things as they stand now -- thanks for finding a
>>> good
>>> >> way out of an uncomfortable situation.
>>> >>
>>> >> That said, two things need to happen:
>>> >> (1) SGA needs to be filed asap, per Board feedback in the last
>>> report, and
>>> >> (2) releases cannot contain any code from the Go SDK before formally
>>> voted
>>> >> on the new component and accepted. This includes source releases that
>>> are
>>> >> created through "assembly", so manual exclusion in the configuration
>>> is
>>> >> likely needed.
>>> >>
>>> >> On Wed, Mar 7, 2018 at 1:54 PM, Kenneth Knowles <k...@google.com>
>>> wrote:
>>> >>>
>>> >>> Re-reading the old thread, I see these desirata:
>>> >>>
>>> >>>  - "enough IO to write end-to-end examples such as WordCount and
>>> >>> demonstrate what IOs would look like"
>>> >>>  - "accounting and tracking the fact that each element has an
>>> associated
>>> >>> window and timestamp"
>>> >>>  - "test suites and test utilities"
>>> >>>
>>> >>> Browsing the code, it looks like these each exist to some level of
>>> >>> completion.
>>> >>>
>>> >>> Kenn
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 7, 2018 at 1:38 PM Robert Bradshaw <rober...@google.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> I was actually thinking along the same lines: what was yet lacking
>>> to
>>> >>>> "officially" merge the Go branch in? The thread we started on this
>>> seems to

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Henning Rohde
Good proposal. I think it should be considered in tandem with the "No
commit on red post-commit" proposal and could be far more ambitious than 2
hours. For example, something in the <15-20 mins range, say, would be much
less of an inconvenience to the development effort. Go takes ~3 mins, which
means that it is practical to wait until a PR is green before asking anyone
to look at it. If I need to wait for a Java or Python pre-commit, I task
switch and come back later. If the post-commits are enforced to be green,
we could possibly gain a much more productive flow at the cost of the
occasional post-commit break, compared to now. Maybe IOs can be less
extensively tested pre-commit, for example, or only if actually changed?

I also like Robert's suggestion of spitting up pre-commits into something
more fine-grained to get a clear partial signal quicker. If we have an
adequate number of Jenkins slots, it might also speed things up overall.

Thanks,
 Henning

On Fri, May 18, 2018 at 12:30 PM Scott Wegner  wrote:

> re: intelligently skipping tests for code that doesn't change (i.e. Java
> tests on Python PR): this should be possible. We already have build-caching
> enabled in Gradle, but I believe it is local to the git workspace and
> doesn't persist between Jenkins runs.
>
> With a quick search, I see there is a Jenkins Build Cacher Plugin [1] that
> hooks into Gradle build cache and does exactly what we need. Does anybody
> know whether we could get this enabled on our Jenkins?
>
> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>
> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw 
> wrote:
>
>> [somehow  my email got garbled...]
>>
>> Now that we're using gradle, perhaps we could be more intelligent about
>> only running the affected tests? E.g. when you touch Python (or Go) you
>> shouldn't need to run the Java precommit at all, which would reduce the
>> latency for those PRs and also the time spent in queue. Presumably this
>> could even be applied per-module for the Java tests. (Maybe a large, shared
>> build cache could help here as well...)
>>
>> I also wouldn't be opposed to a quicker immediate signal, plus more
>> extensive tests before actually merging. It's also nice to not have to wait
>> an hour to see that you have a lint error; quick stuff like that could be
>> signaled quickly before a contributor looses context.
>>
>> - Robert
>>
>>
>>
>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles  wrote:
>>
>>> I like the idea. I think it is a good time for the project to start
>>> tracking this and keeping it usable.
>>>
>>> Certainly 2 hours is more than enough, is that not so? The Java
>>> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
>>> fast it doesn't matter. Do we have enough stragglers that we don't make
>>> it in the 95th percentile? Is the time spent in the Jenkins queue?
>>>
>>> For our current coverage, I'd be willing to go for:
>>>
>>>  - 1 hr hard cap (someone better at stats could choose %ile)
>>>  - roll back or remove test from precommit if fix looks like more than 1
>>> week (roll back if it is perf degradation, remove test from precommit if it
>>> is additional coverage that just doesn't fit in the time)
>>>
>>> There's a longer-term issue that doing a full build each time is
>>> expected to linearly scale up with the size of our repo (it is the monorepo
>>> problem but for a minirepo) so there is no cap that is feasible until we
>>> have effective cross-build caching. And my long-term goal would be <30
>>> minutes. At the latency of opening a pull request and then checking your
>>> email that's not burdensome, but an hour is.
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri  wrote:
>>>
 HI,
 I have a proposal to improve contributor experience by keeping
 precommit times low.

 I'm looking to get community consensus and approval about:
 1. How long should precommits take. 2 hours @95th percentile over the
 past 4 weeks is the current proposal.
 2. The process for dealing with slowness. Do we: fix, roll back, remove
 a test from precommit?
 Rolling back if a fix is estimated to take longer than 2 weeks is the
 current proposal.


 https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing

>>>


Re: Fwd: Closing (automatically?) inactive pull requests

2018-05-14 Thread Henning Rohde
+1

Agree with Robert's sentiment. For timing, I'd suggest a warning after 3
months and closure a month later (a week seems a little tight if it
triggers during vacation/holidays).

On Mon, May 14, 2018 at 2:59 PM Robert Bradshaw  wrote:

> +1
>
> In terms of being empathetic, it might actually be an advantage for an
> action like close to be done automatically rather than feeling like a human
> picked out your PR as being not worth being left open.
> On Mon, May 14, 2018 at 2:42 PM Andrew Pilloud 
> wrote:
>
> > Warnings are really helpful, I've forgotten about PRs on projects I
> rarely contribute to before. Also authors can reopen their closed pull
> requests if they decide they want to work on them again. This seems to be
> already covered in the Stale pull requests section of the contributor
> guide. Seems like you should just make it happen.
>
> > Andrew
>
> > On Mon, May 14, 2018 at 1:26 PM Kenneth Knowles  wrote:
>
> >> Yea, the bot they linked to sends a warning comment first.
>
> >> Kenn
>
> >> On Mon, May 14, 2018 at 7:40 AM Jean-Baptiste Onofré 
> wrote:
>
> >>> Hi,
>
> >>> Do you know if the bot can send a first "warn" comment before closing
> >>> the PR ?
>
> >>> I think that would be great: if the contributor is not active after the
> >>> warn message, then, it's fine to close the PR (the contributor can
> >>> always open a new one later if it makes sense).
>
> >>> Regards
> >>> JB
>
> >>> On 14/05/2018 16:20, Kenneth Knowles wrote:
> >>> > Hi all,
> >>> >
> >>> > Spotted this thread on d...@flink.apache.org
> >>> > . I didn't make a combined thread
> because
> >>> > each project should discuss on our own.
> >>> >
> >>> > I think it would be great to share "stale PR closer bot"
> infrastructure
> >>> > (and this might naturally be a hook where we put other things /
> combine
> >>> > with merge-bot / etc).
> >>> >
> >>> > The downside to automation is being less empathetic - but hopefully
> for
> >>> > very stale PRs no one is really listening anyhow.
> >>> >
> >>> > Kenn
> >>> >
> >>> > -- Forwarded message -
> >>> > From: Ufuk Celebi >
> >>> > Date: Mon, May 14, 2018 at 5:58 AM
> >>> > Subject: Re: Closing (automatically?) inactive pull requests
> >>> > To: >
> >>> >
> >>> >
> >>> > Hey Piotr,
> >>> >
> >>> > thanks for bringing this up. I really like this proposal and also saw
> >>> > it work successfully at other projects. So +1 from my side.
> >>> >
> >>> > - I like the approach with a notification one week before
> >>> > automatically closing the PR
> >>> > - I think a bot will the best option as these kinds of things are
> >>> > usually followed enthusiastically in the beginning but eventually
> >>> > loose traction
> >>> >
> >>> > We can enable better integration with GitHub by using ASF GitBox
> >>> > (https://gitbox.apache.org/setup/) but we should discuss that in a
> >>> > separate thread.
> >>> >
> >>> > – Ufuk
> >>> >
> >>> > On Mon, May 14, 2018 at 12:04 PM, Piotr Nowojski
> >>> > > wrote:
> >>> >  > Hey,
> >>> >  >
> >>> >  > We have lots of open pull requests and quite some of them are
> >>> > stale/abandoned/inactive. Often such old PRs are impossible to merge
> due
> >>> > to conflicts and it’s easier to just abandon and rewrite them.
> >>> > Especially there are some PRs which original contributor created long
> >>> > time ago, someone else wrote some comments/review and… that’s about
> it.
> >>> > Original contributor never shown up again to respond to the comments.
> >>> > Regardless of the reason such PRs are clogging the GitHub, making it
> >>> > difficult to keep track of things and making it almost impossible to
> >>> > find a little bit old (for example 3+ months) PRs that are still
> valid
> >>> > and waiting for reviews. To do something like that, one would have to
> >>> > dig through tens or hundreds of abandoned PRs.
> >>> >  >
> >>> >  > What I would like to propose is to agree on some inactivity dead
> >>> > line, lets say 3 months. After crossing such deadline, PRs should be
> >>> > marked/commented as “stale”, with information like:
> >>> >  >
> >>> >  > “This pull request has been marked as stale due to 3 months of
> >>> > inactivity. It will be closed in 1 week if no further activity
> occurs.
> >>> > If you think that’s incorrect or this pull request requires a review,
> >>> > please simply write any comment.”
> >>> >  >
> >>> >  > Either we could just agree on such policy and enforce it manually
> >>> > (maybe with some simple tooling, like a simple script to list
> inactive
> >>> > PRs - seems like couple of lines in python by using PyGithub) or we
> >>> > could think about automating this action. There are some bots that do
> >>> > exactly this (like this one: 

Re: Go SDK integration tests

2018-05-14 Thread Henning Rohde
Thanks Kenn. For now, the idea is just to have something (wordcount!)
running -- but the Go integration test suite should ideally cover the whole
Go implementation. The difference from ValidatesRunner tests is that it
targets SDK harness coverage, not runner coverage, although there is
substantial overlap. In practice, it'll likely serve both purposes.

On Mon, May 14, 2018 at 1:34 PM Kenneth Knowles <k...@google.com> wrote:

> Nice! This is super important. Commented a bit. Any thoughts on the scope
> of tests you want to add? You mention not as extensive as the Java
> ValidatesRunner, which makes sense. So are you mostly targeting the basic
> example ITs as in the example in the doc?
>
> On the other hand, I think a Golang-based ValidatesRunner suite might do
> an even better job of keeping portable runners from accidentally using
> shortcuts.
>
> Kenn
>
> On Tue, May 8, 2018 at 6:14 PM Henning Rohde <hero...@google.com> wrote:
>
>> Hi everyone,
>>
>>  I'm currently tinkering with adding integration tests for Go (BEAM-3827)
>> and wrote down a small proposal to that end:
>>
>>
>> https://docs.google.com/document/d/1jy6EE7D4RjgfNV0FhD3rMsT1YKhnUfcHRZMAlC6ygXw/edit?usp=sharing
>>
>> Similarly to other SDKs, the proposal is to add self-validating
>> integration tests that don't produce output. But unlike Java, we can't
>> easily reuse the Go example code directly and use a driver program with all
>> tests linked in to run against an arbitrary runner.
>>
>> Comments welcome!
>>
>> Thanks,
>>  Henning
>>
>>


Re: Tracking what works with portability

2018-05-14 Thread Henning Rohde
Thanks Thomas! I'll add a link to the status section on the portability
page.

Henning

On Mon, May 14, 2018 at 2:18 PM Thomas Weise <t...@apache.org> wrote:

> Thanks Henning! This spreadsheet is super helpful and long needed.
>
> I like how this is also serving as portability JIRA index, which I found
> extremely hard to navigate till now. How about making a permalink and
> reference it on https://beam.apache.org/contribute/portability/  ?
>
> Let's hope that the deep red areas in the Flink runner start to change to
> friendlier colors soon as pieces are making their way back from the
> prototype branch..
>
> Thomas
>
>
> On Fri, May 11, 2018 at 12:47 PM, Henning Rohde <hero...@google.com>
> wrote:
>
>> > For runners*SDK pairs that don't have a batch/streaming distinction
>> how about collapsing the columns?
>>
>> There is also often a difference in whether we've actually tried them or
>> whether there are regression tests. Once we have a clearer (= greener and
>> bluer) picture, I'm fine with collapsing some columns. But, for now, I'd
>> like to see how it plays out.
>>
>> Henning
>>
>>
>> On Fri, May 11, 2018 at 12:16 PM Henning Rohde <hero...@google.com>
>> wrote:
>>
>>> > Yea so I guess the column is more just "what works?" and not "what
>>> works with portability?"
>>>
>>> Yeah - the Direct runner column is just "what works". It's included,
>>> because direct runners are still relevant in the portable world and it's
>>> useful to see what is supported there in comparison with the portable
>>> runners. I clarified the caption.
>>>
>>> Henning
>>>
>>> On Fri, May 11, 2018 at 12:12 PM Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> On Fri, May 11, 2018 at 11:46 AM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>>
>>>>> On Fri, May 11, 2018 at 11:40 AM Kenneth Knowles <k...@google.com>
>>>>> wrote:
>>>>>
>>>>>> This is great. "The Beam Vision in a spreadsheet" and/or what the
>>>>>> capability matrix wishes it always had been.
>>>>>>
>>>>>>  - I don't know how to interpret the DirectRunner column. Is it that
>>>>>> it uses ye olde proto round trip? Another level is that it actually
>>>>>> directly links in the SDK harness as a dep and uses the exact code paths
>>>>>> (seems like overkill).
>>>>>>
>>>>>>
>>>>> Its up to the direct runner here to decide what level of execution is
>>>>> actually done via portability APIs but it is meant to be a single process
>>>>> to ease debugging for users.
>>>>>
>>>>
>>>> Yea so I guess the column is more just "what works?" and not "what
>>>> works with portability?" in this case. Just a clarification - either way is
>>>> fine by me. I wasn't sure if the column was to track progress on making the
>>>> direct runners respect the model or whatnot. Without a proto round trip, a
>>>> DirectRunner can easily have non-model behaviors by using information that
>>>> it shouldn't.
>>>>
>>>>  - For runners*SDK pairs that don't have a batch/streaming distinction
>>>>>> how about collapsing the columns?
>>>>>>
>>>>>>
>>>>> Runners may not have a distinction but the portability framework may
>>>>> require more work from a runner to support a use case. A good example of
>>>>> this is side input readiness checking for streaming pipelines.
>>>>>
>>>>
>>>> What do you mean the portability framework? Do you mean an SDK harness?
>>>> Or that the protos do not express enough information?
>>>>
>>>> Kenn
>>>>
>>>>
>>>>  - Anyone have spreadsheet-fu to do a permanent global automatic
>>>>>> hyperlinking of BEAM-?
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Fri, May 11, 2018 at 10:38 AM Henning Rohde <hero...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>>  While the portability framework moves forward, it is often hard to
>>>>>>> figure out exactly what is supported to work at any given time. There
>>>>>>> are still many irregularities, TODOs, bugs and small differences between
>>>>>>> batch and streaming and the portable SDK and runner
>>>>>>> implementations. For example, the answer to the question "Does
>>>>>>> Wordcount run portably?" depends on the SDK, Runner and where the 
>>>>>>> output is
>>>>>>> written.
>>>>>>>
>>>>>>> To this end, I've started a spreadsheet to better track the "swiss
>>>>>>> cheese" of what works portably:
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit?usp=sharing
>>>>>>>
>>>>>>> Note that is is a work in progress. The intended audience is for
>>>>>>> everyone working on or interested in portability. I am hoping we can
>>>>>>> populate, expand and maintain the information as a community, until the
>>>>>>> portability framework support is mature enough to allow SDKs and 
>>>>>>> runners to
>>>>>>> be considered independently.
>>>>>>>
>>>>>>> Comments and suggestions welcome!
>>>>>>>
>>>>>>> Thanks,
>>>>>>>  Henning
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>


Re: Reproducible Environment for Jenkins Tests By Using Container

2018-05-11 Thread Henning Rohde
This is very cool! Added some comments in the doc.

Thanks,
 Henning

On Fri, May 11, 2018 at 3:26 PM Yifan Zou  wrote:

> Hello,
>
> I am working on creating a reproducible build environment for BEAM. The
> goal is having a reproducible environment by using docker for Beam build
> and test on Jenkins and contributors' local machines.
>
> More details are in the proposal:
> https://docs.google.com/document/d/1U7FeVMiHiBP-pFm4ULotqG1QqZY0fi7g9ZwTmeIgvvM/edit?ts=5af60985#
>
> Any thoughts or comments?
>
> Thank you.
>
> Regards.
> Yifan
>
>


Re: Tracking what works with portability

2018-05-11 Thread Henning Rohde
> For runners*SDK pairs that don't have a batch/streaming distinction how
about collapsing the columns?

There is also often a difference in whether we've actually tried them or
whether there are regression tests. Once we have a clearer (= greener and
bluer) picture, I'm fine with collapsing some columns. But, for now, I'd
like to see how it plays out.

Henning


On Fri, May 11, 2018 at 12:16 PM Henning Rohde <hero...@google.com> wrote:

> > Yea so I guess the column is more just "what works?" and not "what
> works with portability?"
>
> Yeah - the Direct runner column is just "what works". It's included,
> because direct runners are still relevant in the portable world and it's
> useful to see what is supported there in comparison with the portable
> runners. I clarified the caption.
>
> Henning
>
> On Fri, May 11, 2018 at 12:12 PM Kenneth Knowles <k...@google.com> wrote:
>
>> On Fri, May 11, 2018 at 11:46 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>>
>>> On Fri, May 11, 2018 at 11:40 AM Kenneth Knowles <k...@google.com> wrote:
>>>
>>>> This is great. "The Beam Vision in a spreadsheet" and/or what the
>>>> capability matrix wishes it always had been.
>>>>
>>>>  - I don't know how to interpret the DirectRunner column. Is it that it
>>>> uses ye olde proto round trip? Another level is that it actually directly
>>>> links in the SDK harness as a dep and uses the exact code paths (seems like
>>>> overkill).
>>>>
>>>>
>>> Its up to the direct runner here to decide what level of execution is
>>> actually done via portability APIs but it is meant to be a single process
>>> to ease debugging for users.
>>>
>>
>> Yea so I guess the column is more just "what works?" and not "what works
>> with portability?" in this case. Just a clarification - either way is fine
>> by me. I wasn't sure if the column was to track progress on making the
>> direct runners respect the model or whatnot. Without a proto round trip, a
>> DirectRunner can easily have non-model behaviors by using information that
>> it shouldn't.
>>
>>  - For runners*SDK pairs that don't have a batch/streaming distinction
>>>> how about collapsing the columns?
>>>>
>>>>
>>> Runners may not have a distinction but the portability framework may
>>> require more work from a runner to support a use case. A good example of
>>> this is side input readiness checking for streaming pipelines.
>>>
>>
>> What do you mean the portability framework? Do you mean an SDK harness?
>> Or that the protos do not express enough information?
>>
>> Kenn
>>
>>
>>  - Anyone have spreadsheet-fu to do a permanent global automatic
>>>> hyperlinking of BEAM-?
>>>>
>>>> Kenn
>>>>
>>>> On Fri, May 11, 2018 at 10:38 AM Henning Rohde <hero...@google.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>>  While the portability framework moves forward, it is often hard to
>>>>> figure out exactly what is supported to work at any given time. There
>>>>> are still many irregularities, TODOs, bugs and small differences between
>>>>> batch and streaming and the portable SDK and runner implementations.
>>>>> For example, the answer to the question "Does Wordcount run
>>>>> portably?" depends on the SDK, Runner and where the output is written.
>>>>>
>>>>> To this end, I've started a spreadsheet to better track the "swiss
>>>>> cheese" of what works portably:
>>>>>
>>>>>
>>>>> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit?usp=sharing
>>>>>
>>>>> Note that is is a work in progress. The intended audience is for
>>>>> everyone working on or interested in portability. I am hoping we can
>>>>> populate, expand and maintain the information as a community, until the
>>>>> portability framework support is mature enough to allow SDKs and runners 
>>>>> to
>>>>> be considered independently.
>>>>>
>>>>> Comments and suggestions welcome!
>>>>>
>>>>> Thanks,
>>>>>  Henning
>>>>>
>>>>>
>>>>>
>>>>>


Re: Tracking what works with portability

2018-05-11 Thread Henning Rohde
> Yea so I guess the column is more just "what works?" and not "what works
with portability?"

Yeah - the Direct runner column is just "what works". It's included,
because direct runners are still relevant in the portable world and it's
useful to see what is supported there in comparison with the portable
runners. I clarified the caption.

Henning

On Fri, May 11, 2018 at 12:12 PM Kenneth Knowles <k...@google.com> wrote:

> On Fri, May 11, 2018 at 11:46 AM Lukasz Cwik <lc...@google.com> wrote:
>
>>
>> On Fri, May 11, 2018 at 11:40 AM Kenneth Knowles <k...@google.com> wrote:
>>
>>> This is great. "The Beam Vision in a spreadsheet" and/or what the
>>> capability matrix wishes it always had been.
>>>
>>>  - I don't know how to interpret the DirectRunner column. Is it that it
>>> uses ye olde proto round trip? Another level is that it actually directly
>>> links in the SDK harness as a dep and uses the exact code paths (seems like
>>> overkill).
>>>
>>>
>> Its up to the direct runner here to decide what level of execution is
>> actually done via portability APIs but it is meant to be a single process
>> to ease debugging for users.
>>
>
> Yea so I guess the column is more just "what works?" and not "what works
> with portability?" in this case. Just a clarification - either way is fine
> by me. I wasn't sure if the column was to track progress on making the
> direct runners respect the model or whatnot. Without a proto round trip, a
> DirectRunner can easily have non-model behaviors by using information that
> it shouldn't.
>
>  - For runners*SDK pairs that don't have a batch/streaming distinction how
>>> about collapsing the columns?
>>>
>>>
>> Runners may not have a distinction but the portability framework may
>> require more work from a runner to support a use case. A good example of
>> this is side input readiness checking for streaming pipelines.
>>
>
> What do you mean the portability framework? Do you mean an SDK harness? Or
> that the protos do not express enough information?
>
> Kenn
>
>
>  - Anyone have spreadsheet-fu to do a permanent global automatic
>>> hyperlinking of BEAM-?
>>>
>>> Kenn
>>>
>>> On Fri, May 11, 2018 at 10:38 AM Henning Rohde <hero...@google.com>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>>  While the portability framework moves forward, it is often hard to
>>>> figure out exactly what is supported to work at any given time. There
>>>> are still many irregularities, TODOs, bugs and small differences between
>>>> batch and streaming and the portable SDK and runner implementations.
>>>> For example, the answer to the question "Does Wordcount run portably?"
>>>> depends on the SDK, Runner and where the output is written.
>>>>
>>>> To this end, I've started a spreadsheet to better track the "swiss
>>>> cheese" of what works portably:
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit?usp=sharing
>>>>
>>>> Note that is is a work in progress. The intended audience is for
>>>> everyone working on or interested in portability. I am hoping we can
>>>> populate, expand and maintain the information as a community, until the
>>>> portability framework support is mature enough to allow SDKs and runners to
>>>> be considered independently.
>>>>
>>>> Comments and suggestions welcome!
>>>>
>>>> Thanks,
>>>>  Henning
>>>>
>>>>
>>>>
>>>>


Tracking what works with portability

2018-05-11 Thread Henning Rohde
Hi everyone,

 While the portability framework moves forward, it is often hard to figure
out exactly what is supported to work at any given time. There are still
many irregularities, TODOs, bugs and small differences between batch and
streaming and the portable SDK and runner implementations. For example, the
answer to the question "Does Wordcount run portably?" depends on the SDK,
Runner and where the output is written.

To this end, I've started a spreadsheet to better track the "swiss cheese"
of what works portably:


https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit?usp=sharing

Note that is is a work in progress. The intended audience is for everyone
working on or interested in portability. I am hoping we can populate,
expand and maintain the information as a community, until the portability
framework support is mature enough to allow SDKs and runners to be
considered independently.

Comments and suggestions welcome!

Thanks,
 Henning


Go SDK integration tests

2018-05-08 Thread Henning Rohde
Hi everyone,

 I'm currently tinkering with adding integration tests for Go (BEAM-3827)
and wrote down a small proposal to that end:

https://docs.google.com/document/d/1jy6EE7D4RjgfNV0FhD3rMsT1YKhnUfcHRZMAlC6ygXw/edit?usp=sharing

Similarly to other SDKs, the proposal is to add self-validating integration
tests that don't produce output. But unlike Java, we can't easily reuse the
Go example code directly and use a driver program with all tests linked in
to run against an arbitrary runner.

Comments welcome!

Thanks,
 Henning


Re: Graal instead of docker?

2018-05-08 Thread Henning Rohde
There are indeed lots of possibilities for interesting docker alternatives
with different tradeoffs and capabilities, but in generally both the runner
as well as the SDK must support them for it to work. As mentioned, docker
(as used in the container contract) is meant as a flexible main option but
not necessarily the only option. I see no problem with certain
pipeline-SDK-runner combinations additionally supporting a specialized
setup. Pipeline can be a factor, because that some transforms might depend
on aspects of the runtime environment -- such as system libraries or
shelling out to a /bin/foo.

The worker boot code is tied to the current container contract, so
pre-launched workers would presumably not use that code path and are not be
bound by its assumptions. In particular, such a setup might want to invert
who initiates the connection from the SDK worker to the runner. Pipeline
options and global state in the SDK and user functions process might make
it difficult to safely reuse worker processes across pipelines, but also
doable in certain scenarios.

Henning

On Tue, May 8, 2018 at 3:51 PM Thomas Weise  wrote:

>
>
> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw 
> wrote:
>
>>
>> I would welcome changes to
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind is
>> "I
>> already brought up a worker(s) for you (which could be the same process
>> that handled pipeline construction in testing scenarios), here's how to
>> connect to it/them.") Another option, which would seem to appeal to you in
>> particular, would be "the worker code is linked into the runner's binary,
>> use this process as the worker" (though note even for java-on-java, it can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter should
>> still likely use the FnApi to talk to itself (either over GRPC on local
>> ports, or possibly better via direct function calls eliminating the RPC
>> overhead altogether--this is how the fast local runner in Python works).
>> There may be runner environments well controlled enough that "start up the
>> workers" could be specified as "run this command line." We should make
>> this
>> environment message extensible to other alternatives than "docker
>> container
>> url," though of course we don't want the set of options to grow too large
>> or we loose the promise of portability unless every runner supports every
>> protocol.
>>
>>
> The pre-launched worker would be an interesting option, which might work
> well for a sidecar deployment.
>
> The current worker boot code though makes the assumption that the runner
> endpoint to phone home to is known when the process is launched. That
> doesn't work so well with a runner that establishes its endpoint
> dynamically. Also, the assumption is baked in that a worker will only serve
> a single pipeline (provisioning API etc.).
>
> Thanks,
> Thomas
>
>


Re: Reading all elements of a PCollection after running beam go pipeline

2018-05-04 Thread Henning Rohde
Great!

On Fri, May 4, 2018 at 4:37 PM 8 Gianfortoni <8...@tokentransit.com> wrote:

> Thanks for the workaround! That should work for me.
>
> On Fri, May 4, 2018, 1:51 PM Henning Rohde <hero...@google.com> wrote:
>
>> Hey there,
>>
>>  Until side input is fully supported, you can use GBK with a fixed key to
>> get all elements in a single bundle (assuming global windowing). That is
>> how textio.Write works internally to produce a single file currently:
>>
>> https://github.com/apache/beam/blob/90ba47114997e75a3a7daa1e8db92768d05e5432/sdks/go/pkg/beam/io/textio/textio.go#L144
>>
>>
>>
>> Thanks,
>>  Henning
>>
>>
>> On Fri, May 4, 2018 at 11:41 AM 8 Gianfortoni <8...@tokentransit.com> wrote:
>>
>>> Hi dev team,
>>>
>>> I would like to be able to read the entire results of a PCollection
>>> serially after running beam. In other frameworks this is fairly
>>> straightforward, but I don't understand how one might do this with the Beam
>>> Go SDK.
>>>
>>> I guess I can read in a file that I write, but I want to be able to read
>>> the elements in struct format. I looked through a bunch of examples but
>>> don't see any that do this.
>>>
>>> Thanks,
>>> 8
>>>
>>


Re: Graal instead of docker?

2018-05-04 Thread Henning Rohde
Romain,

Docker, unlike selinux, solves a great number of tangible problems for us
with IMO a relatively small tax. It does not have to be the only way. Some
of the concerns you bring up along with possibilities were also discussed
here: https://s.apache.org/beam-fn-api-container-contract. I encourage you
to take a look.

Thanks,
 Henning


On Fri, May 4, 2018 at 3:18 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 4 mai 2018 21:31, "Henning Rohde" <hero...@google.com> a écrit :
>
> I disagree with the characterization of docker and the implications made
> towards portability. Graal looks like a neat project (and I never thought
> I would live to see the phrase "Practical Partial Evaluation" ..), but it
> doesn't address the needs of portability. In addition to Luke's examples,
> Go and most other languages don't work on it either. Docker containers also
> address packaging, OS dependencies, conflicting versions and distribution
> aspects in addition to truly universal language support.
>
>
> This is wrong, docker also has its conflicts, is not universal (fails on
> windows and mac easily - as host or not, cloud vendors put layers limiting
> or corrupting it, and it is an infra constraint imposed and a vendor
> locking not welcomed in beam IMHO).
>
> This is my main concern. All the work done looks like an implemzntation
> detail of one runner+vendor corrupting all the project and adding
> complexity and work to everyone instead of keeping it localised
> (technically it is possible).
>
> Would you accept i enforce you to use selinux? Using docker is the same
> kind of constraint.
>
>
> That said, it's entirely fine for some runners to use Jython, Graal, etc
> to provide a specialized offering similar to the direct runners, but it
> would be disjoint from portability IMO.
>
> On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>>
>>
>> Le 4 mai 2018 17:55, "Lukasz Cwik" <lc...@google.com> a écrit :
>>
>> I did take a look at Graal a while back when thinking about how execution
>> environments could be defined, my concerns were related to it not
>> supporting all of the features of a language.
>> For example, its typical for Python to load and call native libraries and
>> Graal can only execute C/C++ code that has been compiled to LLVM.
>> Also, a good amount of people interested in using ML libraries will want
>> access to GPUs to improve performance which I believe that Graal can't
>> support.
>>
>> It can be a very useful way to run simple lamda functions written in some
>> language directly without needing to use a docker environment but you could
>> probably use something even lighter weight then Graal that is language
>> specific like Jython.
>>
>>
>>
>> Right, the jsr223 impl works very well but you can also have a perf boost
>> using native (like v8 java binding for js for instance). It is way more
>> efficient than docker most of the time and not code intrusive at all in
>> runners so likely more adoption-able and maintainable. That said all is
>> doable behind the jsr223 so maybe not a big deal in terms of api. We just
>> need to ensure portability work stay clean and actually portable and doesnt
>> impact runners as poc done until today did.
>>
>> Works for me.
>>
>>
>> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>> Hi guys
>>>
>>> Since some time there are efforts to have a language portable support in
>>> beam but I cant really find a case it "works" being based on docker except
>>> for some vendor specific infra.
>>>
>>> Current solution:
>>>
>>> 1. Is runner intrusive (which is bad for beam and prevents adoption of
>>> big data vendors)
>>> 2. Based on docker (which assumed a runtime environment and is very
>>> ops/infra intrusive and likely too $$ quite often for what it brings)
>>>
>>> Did anyone had a look to graal which seems a way to make the feature
>>> doable in a lighter manner and optimized compared to default jsr223 impls?
>>>
>>>
>>
>


Re: Reading all elements of a PCollection after running beam go pipeline

2018-05-04 Thread Henning Rohde
Hey there,

 Until side input is fully supported, you can use GBK with a fixed key to
get all elements in a single bundle (assuming global windowing). That is
how textio.Write works internally to produce a single file currently:

https://github.com/apache/beam/blob/90ba47114997e75a3a7daa1e8db92768d05e5432/sdks/go/pkg/beam/io/textio/textio.go#L144



Thanks,
 Henning


On Fri, May 4, 2018 at 11:41 AM 8 Gianfortoni <8...@tokentransit.com> wrote:

> Hi dev team,
>
> I would like to be able to read the entire results of a PCollection
> serially after running beam. In other frameworks this is fairly
> straightforward, but I don't understand how one might do this with the Beam
> Go SDK.
>
> I guess I can read in a file that I write, but I want to be able to read
> the elements in struct format. I looked through a bunch of examples but
> don't see any that do this.
>
> Thanks,
> 8
>


Re: Graal instead of docker?

2018-05-04 Thread Henning Rohde
I disagree with the characterization of docker and the implications made
towards portability. Graal looks like a neat project (and I never thought I
would live to see the phrase "Practical Partial Evaluation" ..), but it
doesn't address the needs of portability. In addition to Luke's examples,
Go and most other languages don't work on it either. Docker containers also
address packaging, OS dependencies, conflicting versions and distribution
aspects in addition to truly universal language support.

That said, it's entirely fine for some runners to use Jython, Graal, etc to
provide a specialized offering similar to the direct runners, but it would
be disjoint from portability IMO.

On Fri, May 4, 2018 at 10:14 AM Romain Manni-Bucau 
wrote:

>
>
> Le 4 mai 2018 17:55, "Lukasz Cwik"  a écrit :
>
> I did take a look at Graal a while back when thinking about how execution
> environments could be defined, my concerns were related to it not
> supporting all of the features of a language.
> For example, its typical for Python to load and call native libraries and
> Graal can only execute C/C++ code that has been compiled to LLVM.
> Also, a good amount of people interested in using ML libraries will want
> access to GPUs to improve performance which I believe that Graal can't
> support.
>
> It can be a very useful way to run simple lamda functions written in some
> language directly without needing to use a docker environment but you could
> probably use something even lighter weight then Graal that is language
> specific like Jython.
>
>
>
> Right, the jsr223 impl works very well but you can also have a perf boost
> using native (like v8 java binding for js for instance). It is way more
> efficient than docker most of the time and not code intrusive at all in
> runners so likely more adoption-able and maintainable. That said all is
> doable behind the jsr223 so maybe not a big deal in terms of api. We just
> need to ensure portability work stay clean and actually portable and doesnt
> impact runners as poc done until today did.
>
> Works for me.
>
>
> On Thu, May 3, 2018 at 10:05 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys
>>
>> Since some time there are efforts to have a language portable support in
>> beam but I cant really find a case it "works" being based on docker except
>> for some vendor specific infra.
>>
>> Current solution:
>>
>> 1. Is runner intrusive (which is bad for beam and prevents adoption of
>> big data vendors)
>> 2. Based on docker (which assumed a runtime environment and is very
>> ops/infra intrusive and likely too $$ quite often for what it brings)
>>
>> Did anyone had a look to graal which seems a way to make the feature
>> doable in a lighter manner and optimized compared to default jsr223 impls?
>>
>>
>


Re: Gradle Status: Migrated!

2018-05-01 Thread Henning Rohde
JB - for your comparison, please also omit cross-compiling all the Go
examples because they are only built using Gradle.




On Tue, May 1, 2018 at 8:59 AM Jean-Baptiste Onofré  wrote:

> Thanks for the update Kenn, that makes sense.
>
> I'm checking the artifacts generated by Gradle right now.
>
> Regards
> JB
>
> On 01/05/2018 17:42, Kenneth Knowles wrote:
> > Raw execution time for tasks from clean is not the only thing to test. I
> > would say it is not even important. Try these from clean:
> >
> >   - Gradle: ./gradlew :beam-sdks-java-io-mongodb:test && ./gradlew
> > :beam-sdks-java-io-mongodb:test
> >   - Maven: mvn -pl sdks/java/io/mongodb test -am && mvn -pl
> > sdks/java/io/mongodb test -am
> >
> > Quick run on my laptop:
> >
> >   - Gradle: 66s (65s then 1s)
> >   - Maven: 317s (173s then 144s)
> >
> > Of course, the mvn command runs a bunch of useless executions AND it is
> > incorrect because it isn't using built jars. That's part of the point -
> > there is no way to do what you want with mvn. Let's try to make a
> > command that avoids useless work and builds the jars:
> >
> >   - Maven:  (mvn -pl sdks/java/io/mongodb install -DskipTests -am && mvn
> > -pl sdks/java/io/mongodb test) && (each time)
> >
> > That takes 102s the first time and 64s the second time. And that is
> > about the realistic workflow for someone trying to get something done.
> > Even if we touch a file Gradle finishes in 20s. So the best case for mvn
> > is this head-to-head:
> >
> >   - Gradle: 65s + 20s + 20s + 20s + 20s + ...
> >   - Maven: 102s + 64s + 64s + 64s + 64s + ...
> >
> > Kenn
> >
> >
> > On Tue, May 1, 2018 at 8:09 AM Jean-Baptiste Onofré  > > wrote:
> >
> > Thanks, for me, Maven 3.5.2 takes quite the same time than Gradle
> > (using
> > the wrapper). It's maybe related to my environment.
> >
> > Anyway, I'm doing a complete build review both in term of building
> > time,
> > and equivalence (artifacts publishing, test, plugin execution).
> >
> > I will provide an update soon.
> >
> > Regards
> > JB
> >
> > On 01/05/2018 16:57, Reuven Lax wrote:
> >  > Luke did gather data which showed that on our Jenkins executors
> the
> >  > Gradle build was much faster than the Maven build. Also right now
> we
> >  > have incremental builds turned off, but once we're confident
> > enough to
> >  > enable them (at least for local development) that will often drop
> > build
> >  > times a lot.
> >  >
> >  > On Tue, May 1, 2018 at 4:01 AM Jean-Baptiste Onofré
> > 
> >  > >> wrote:
> >  >
> >  > By the way, I'm curious: did someone evaluate the build time
> gap
> >  > between Maven
> >  > and Gradle ? One of the main reason to migrate to Gradle was
> > the inc
> >  > build and
> >  > build time. The builds I have launched are quite the same in
> >  > duration. I will do
> >  > deeper tests to evaluate the gap.
> >  >
> >  > Regards
> >  > JB
> >  >
> >  > On 05/01/2018 12:48 PM, Łukasz Gajowy wrote:
> >  >  > Hi Scott,
> >  >  >
> >  >  > thanks for the update! Just a clarification about IO
> > performance
> >  > tests: those
> >  >  > were fully migrated in Beam and all task necessary for
> running
> >  > them are there
> >  >  > but Jenkins jobs still run mvn commands. This is due the
> > fact that
> >  >  > PerfkitBenchmarker code (which is invoked by Jenkins and
> >  > constructs the commands
> >  >  > by itself) was not updated yet. This should be finished
> before
> >  > fully dropping mvn.
> >  >  >
> >  >  > More on that topic here, in
> >  >  > comments: https://issues.apache.org/jira/browse/BEAM-3942
> >  >  > PR changing the commands to gradle is waiting for PerfKit
> > devs review
> >  >  > here:
> >  >
> https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1648
> >  >  >
> >  >  > Best regards,
> >  >  >
> >  >  > 2018-05-01 9:17 GMT+02:00 Romain Manni-Bucau
> >  > 
> > >
> >  >  >  >   >  >  >  >
> >  >  > Hi Scott
> >  >  >
> >  >  > While
> >  >
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4057
> >  >  >
> >  >
> >   <
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4057> is
> >  >  > open, gradle is a concurrent 

Re: Kafka connector for Beam Python SDK

2018-04-30 Thread Henning Rohde
Although I suspect/hope that sharing IO connectors across SDKs will
adequately cover the lion's share of implementations (especially the long
tail), I also think it's a case-by-case decision to make. Native IO might
be preferable for some uses and each SDK will want IO implementations where
they shine or at least for reference. I think of these options as
complementary.

For cross-language IO connectors that uses user functions in an intimate
way, to Reuven's point, the IO connector will have to be implemented in a
way that makes each user function a transform so that it can be supplied
and executed in the user's SDK. The current practice of embedding user
functions in DoFns won't work. This will require more fusion breaks (and
coding of data) than otherwise needed and could be a performance penalty,
unless the IO connector can be written in a way that avoids the user
function as Kenn suggests.

Small +1 to Kenn's idea of auditing the existing IO connectors to get a
sense of which IO might be problematic. However, it might be a tad
premature to do too much until the cross-language transform feature is
fleshed out further.

Henning


On Mon, Apr 30, 2018 at 9:54 AM Kenneth Knowles  wrote:

> I agree with Cham's motivations as far as "we need it now" and getting
> Python SDF up and running and exercised on a real connector.
>
> But I do find the current API of BigQueryIO to be a poor example. That
> particular functionality on BigQueryIO seems extraneous and goes against
> our own style guide [1]. The recommended way to write it would be for
> BigQueryIO to output a natural concrete type (like TableRow) and allow the
> following step to do conversions. This is a broader programming best
> practice - unless there is a compelling reason, you should just return the
> value rather than accept a higher-order function to operate on the value.
> Is there a compelling reason in this case? I just dug through the code and
> just see that it bottoms out in AvroSource where it does not seem to add
> functionality.
>
> Considering cross-language pipelines as a primary use case for all
> connectors, perhaps we should audit them and bring them into alignment now,
> deprecating paths using higher-order functions. We can still consider
> host-language convenience composites.
>
> For an unbounded source like KafkaIO the compelling reason is the
> timestamp extracting function to be able to maintain a watermark. Notably,
> PubsubIO does not accept such a function, but requires the timestamp to be
> in a metadata field that any language can describe (versus having to parse
> the message to pull out the timestamp).
>
> Kenn
>
> [1]
> https://beam.apache.org/contribute/ptransform-style-guide/#choosing-types-of-input-and-output-pcollections
>
> On Mon, Apr 30, 2018 at 9:27 AM Reuven Lax  wrote:
>
>> Another point: cross-language IOs might add a performance penalty in many
>> cases. For an example of this look at BigQueryIO. The user can register a
>> SerializableFunction that is evaluated on every record, and determines
>> which destination to write the record to. Now a Python user would want to
>> register a Python function for this of course. this means that the Java IO
>> would have to invoke Python code for each record it sees, which will likely
>> be a big performance hit.
>>
>> Of course the downside of duplicating IOs is exactly as you say -
>> multiple versions to maintain, and potentially duplicate bugs. I think the
>> right answer will need to be on a case-by-case basis.
>>
>> Reuven
>>
>> On Mon, Apr 30, 2018 at 8:05 AM Chamikara Jayalath 
>> wrote:
>>
>>> Hi Aljoscha,
>>>
>>> I tried to cover this in the doc. Once we have full support for
>>> cross-language IO, we can decide this on a case-by-case basis. But I don't
>>> think we should cease defining new sources/sinks for Beam Python SDK till
>>> we get to that point. I think there are good reasons for adding Kafka
>>> support for Python today and many Beam users have request this. Also, note
>>> that proposed Python Kafka source will be based on the Splittable DoFn
>>> framework while the current Java version is based on the UnboundedSource
>>> framework. Here are the reasons that are currently listed in the doc.
>>>
>>>
>>>-
>>>
>>>Users might find it useful to have at least one unbounded source and
>>>sink combination implemented in Python SDK and Kafka is the streaming
>>>system that makes most sense to support if we just want to add support 
>>> for
>>>only one such system in Python SDK.
>>>-
>>>
>>>Not all runners might support cross-language IO. Also some
>>>user/runner/deployment combinations might require an unbounded 
>>> source/sink
>>>implemented in Python SDK.
>>>-
>>>
>>>We recently added Splittable DoFn support to Python SDK. It will be
>>>good to have at least one production quality Splittable DoFn that
>>>will server as a good example for any 

Re: Custom URNs and runner translation

2018-04-24 Thread Henning Rohde
> Note that a KafkaDoFn still needs to be provided, but could be a DoFn that
> fails loudly if it's actually called in the short term rather than a full
> Python implementation.

For configurable runner-native IO, for now, I think it is reasonable to use
a URN + special data payload directly without a KafkaDoFn -- assuming it's
a portable pipeline. That's what we do in Go for PubSub-on-Dataflow and
something similar would work for Kafka-on-Flink as well. I agree that
non-native alternative implementation is desirable, but if one is not
present we should IMO rather fail at job submission instead of at runtime.
I could imagine connectors intrinsic to an execution engine where
non-native implementations are not possible.


On Tue, Apr 24, 2018 at 3:09 PM Robert Bradshaw  wrote:

> On Tue, Apr 24, 2018 at 1:14 PM Thomas Weise  wrote:
>
> > Hi Cham,
>
> > Thanks for the feedback!
>
> > I should have probably clarified that my POC and questions aren't
> specific to Kafka as source, but pretty much any other source/sink that we
> internally use as well. We have existing Flink pipelines that are written
> in Java and we want to use the same connectors with the Python SDK on top
> of the already operationalized Flink stack. Therefore, portability isn't a
> concern as much as the ability to integrate is.
>
> > -->
>
> > On Tue, Apr 24, 2018 at 12:00 PM, Chamikara Jayalath
> > 
> wrote:
>
> >> Hi Thomas,
>
> >> Seems like we are working on similar (partially) things :).
>
> >> On Tue, Apr 24, 2018 at 9:03 AM Thomas Weise  wrote:
>
> >>> I'm working on a mini POC to enable Kafka as custom streaming source
> for a Python pipeline executing on the (in-progress) portable Flink runner.
>
> >>> We eventually want to use the same native Flink connectors for sources
> and sinks that we also use in other Flink jobs.
>
>
> >> Could you clarify what you mean by same Flink connector ? Do you mean
> that Beam-based and non-Beam-based versions of Flink will use the same
> Kafka connector implementation ?
>
>
> > The native Flink sources as shown in the example below, not the Beam
> KafkaIO or other Beam sources.
>
>
>
> >>> I got a simple example to work with the FlinkKafkaConsumer010 reading
> from Kafka and a Python lambda logging the value. The code is here:
>
>
>
> https://github.com/tweise/beam/commit/79b682eb4b83f5b9e80f295464ebf3499edb1df9
>
>
>
> >>> I'm looking for feedback/opinions on the following items in particular:
>
> >>> * Enabling custom translation on the Flink portable runner (custom
> translator could be loaded with ServiceLoader, additional translations
> could also be specified as job server configuration, pipeline option, ...)
>
> >>> * For the Python side, is what's shown in the commit the recommended
> way to define a custom transform (it would eventually live in a reusable
> custom module that pipeline authors can import)? Also, the example does not
> have the configuration part covered yet..
>
>
> >> The only standard unbounded source API offered by Python SDK is the
> Splittable DoFn API. This is the part I'm working on. I'm trying to add a
> Kafka connector for Beam Python SDK using SDF API. JIRA is
> https://issues.apache.org/jira/browse/BEAM-3788. I'm currently comparing
> different Kafka Python client libraries. Will share more information on
> this soon.
>
> >> I understand this might not be possible in all cases and we might want
> to consider adding a native source/sink implementations. But this will
> result in the implementation being runner-specific (each runner will have
> to have it's own source/sink implementation). So I think we should try to
> add connector implementations to Beam using the standard API whenever
> possible. We also have plans to implement support for cross SDK transforms
> in the future (so that we can utilize Java implementation from Python for
> example) but we are not there yet and we might still want to implement a
> connector for a given SDK if there's good client library support.
>
>
> > It is great that the Python SDK will have connectors that are written in
> Python in the future, but I think it is equally if not more important to be
> able to use at least the Java Beam connectors with Python SDK (and any
> other non-Java SDK). Especially in a fully managed environment it should be
> possible to offer this to users in a way that is largely transparent. It
> takes significant time and effort to mature connectors and I'm not sure it
> is realistic to repeat that for all external systems in multiple languages.
> Or, to put it in another way, it is likely that instead of one over time
> rock solid connector per external system there will be multiple less mature
> implementations. That's also the reason we internally want to use the Flink
> native connectors - we know what they can and cannot do and want to
> leverage the existing investment.
>
> There are two related issues here: how to 

Re: Go SDK build failures with maven

2018-04-20 Thread Henning Rohde
Great!

On Fri, Apr 20, 2018 at 6:28 AM Colm O hEigeartaigh <cohei...@apache.org>
wrote:

> Thanks for the quick turnaround! Both builds now working correctly for me.
>
> Colm.
>
> On Fri, Apr 20, 2018 at 12:50 AM, Henning Rohde <hero...@google.com>
> wrote:
>
>> Hi Colm,
>>
>>   The extra pubsub dependency broke the Maven build. The warning you see
>> in gradle seems to be a (non-breaking) linter check that the format string,
>> which is interesting given that the function called is an internal one --
>> but the check happens to be correct.
>>
>>  Sent you https://github.com/apache/beam/pull/5190 for both issues.
>> Sorry about the noise.
>>
>> Thanks,
>>  Henning
>>
>> On Thu, Apr 19, 2018 at 10:01 AM Henning Rohde <hero...@google.com>
>> wrote:
>>
>>> The Go build works best with Gradle and that's what you should use
>>> (other than possibly manually running go build etc). It looks like the
>>> build might be broken due to the Go streaming PR independently of the build
>>> tooling. Let me take a look.
>>>
>>> On Thu, Apr 19, 2018 at 7:09 AM Colm O hEigeartaigh <cohei...@apache.org>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Is the Apache Maven build still maintained? I'm seeing some recent
>>>> failures in the Go SDK:
>>>>
>>>> [INFO] --- mvn-golang-wrapper:2.1.7:test (go-test) @ beam-sdks-go ---
>>>> [INFO] Prepared command line : bin/go test ./...
>>>> [ERROR]
>>>> [ERROR] -Exec.Err-
>>>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>>>> cloud.google.com/go/pubsub/subscription.go:30:2: cannot find package "
>>>> golang.org/x/sync/errgroup" in any of:
>>>> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
>>>> golang.org/x/sync/errgroup (from $GOROOT)
>>>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>>>> golang.org/x/sync/errgroup (from $GOPATH)
>>>> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
>>>> golang.org/x/sync/errgroup
>>>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>>>> cloud.google.com/go/pubsub/flow_controller.go:19:2: cannot find
>>>> package "golang.org/x/sync/semaphore" in any of:
>>>> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
>>>> golang.org/x/sync/semaphore (from $GOROOT)
>>>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>>>> golang.org/x/sync/semaphore (from $GOPATH)
>>>> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
>>>> golang.org/x/sync/semaphore
>>>> [ERROR]
>>>>
>>>> Incidentally, when run via gradle I see:
>>>>
>>>> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75:
>>>> wrong number of args for format in Errorf call: 1 needed but 2 args
>>>> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78:
>>>> wrong number of args for format in Errorf call: 1 needed but 2 args
>>>> /home/colm/src/apache/beam/sdks/go/target/src/
>>>> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75:
>>>> wrong number of args for format in Errorf call: 1 needed but 2 args
>>>> /home/colm/src/apache/beam/sdks/go/target/src/
>>>> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78:
>>>> wrong number of args for format in Errorf call: 1 needed but 2 args
>>>>
>>>>
>>>> Colm.
>>>>
>>>>
>>>> --
>>>> Colm O hEigeartaigh
>>>>
>>>> Talend Community Coder
>>>> http://coders.talend.com
>>>>
>>>
>
>
> --
> Colm O hEigeartaigh
>
> Talend Community Coder
> http://coders.talend.com
>


Re: The Go SDK got accidentally merged - options to deal with the pain

2018-04-19 Thread Henning Rohde
Hi everyone,

 Thank you all for your patience. The last major identified feature (Go
windowing) is now in review: https://github.com/apache/beam/pull/5179. The
remaining work listed under

 https://issues.apache.org/jira/browse/BEAM-2083

is integration tests and documentation (quickstart, etc). I expect that
will take a few weeks after which we should be in a position to do a vote
about making the Go SDK an official Beam SDK. To this end, please do take a
look at the listed tasks and let me know if there us anything missing.

Lastly, I have a practical question: how should we order the PRs to the
beam site documentation wrt the vote? Should we get PRs accepted, but not
committed before a vote? Or just commit them as they are ready to avoid
potential merge conflicts?

Thanks!

Henning




On Sat, Mar 10, 2018 at 10:45 AM Henning Rohde <hero...@google.com> wrote:

> Thank you all! I've added the remaining work -- as I understand it -- as
> dependencies to the overall Go SDK issue (tracking the "official" merge to
> master):
>
> https://issues.apache.org/jira/browse/BEAM-2083
>
> Please feel free to add to this list or expand the items, if there is
> anything I overlooked. If this presence of the Go SDK in master cause
> issues for other modules, please simply file a bug against me and I'll take
> care of it.
>
> Robert - I understand your last reply as addressing Davor's points. Please
> let me know if there is anything I need to do in that regard.
>
> Henning
>
>
>
> On Fri, Mar 9, 2018 at 8:39 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> +1 to let it evolve in master (+Davor points), having ongoing work on
>> master makes sense given the state of advance + the hope that this
>> won't add any issue for the other modules.
>>
>> On Thu, Mar 8, 2018 at 7:30 PM, Robert Bradshaw <rober...@google.com>
>> wrote:
>> > +1 to both of these points. SGA should have probably already been
>> filed, and
>> > excising this from releases should be easy, but I added a line item to
>> the
>> > validation checklist template to make sure we don't forget.
>> >
>> > On Thu, Mar 8, 2018 at 7:13 AM Davor Bonaci <da...@apache.org> wrote:
>> >>
>> >> I support leaving things as they stand now -- thanks for finding a good
>> >> way out of an uncomfortable situation.
>> >>
>> >> That said, two things need to happen:
>> >> (1) SGA needs to be filed asap, per Board feedback in the last report,
>> and
>> >> (2) releases cannot contain any code from the Go SDK before formally
>> voted
>> >> on the new component and accepted. This includes source releases that
>> are
>> >> created through "assembly", so manual exclusion in the configuration is
>> >> likely needed.
>> >>
>> >> On Wed, Mar 7, 2018 at 1:54 PM, Kenneth Knowles <k...@google.com>
>> wrote:
>> >>>
>> >>> Re-reading the old thread, I see these desirata:
>> >>>
>> >>>  - "enough IO to write end-to-end examples such as WordCount and
>> >>> demonstrate what IOs would look like"
>> >>>  - "accounting and tracking the fact that each element has an
>> associated
>> >>> window and timestamp"
>> >>>  - "test suites and test utilities"
>> >>>
>> >>> Browsing the code, it looks like these each exist to some level of
>> >>> completion.
>> >>>
>> >>> Kenn
>> >>>
>> >>>
>> >>> On Wed, Mar 7, 2018 at 1:38 PM Robert Bradshaw <rober...@google.com>
>> >>> wrote:
>> >>>>
>> >>>> I was actually thinking along the same lines: what was yet lacking to
>> >>>> "officially" merge the Go branch in? The thread we started on this
>> seems to
>> >>>> have fizzled out over the holidays, but windowing support is the only
>> >>>> must-have missing technical feature in my book (assuming
>> documentation and
>> >>>> testing are, or are brought up to snuff).
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 7, 2018 at 1:35 PM Henning Rohde <hero...@google.com>
>> wrote:
>> >>>>>
>> >>>>> One thought: the Go SDK is actually not that far away from
>> satisfying
>> >>>>> the guidelines for merging to master anyway (as discussed here
>> [1]). If we
>> >>>>> decide to simply 

Re: Go SDK build failures with maven

2018-04-19 Thread Henning Rohde
Hi Colm,

  The extra pubsub dependency broke the Maven build. The warning you see in
gradle seems to be a (non-breaking) linter check that the format string,
which is interesting given that the function called is an internal one --
but the check happens to be correct.

 Sent you https://github.com/apache/beam/pull/5190 for both issues. Sorry
about the noise.

Thanks,
 Henning

On Thu, Apr 19, 2018 at 10:01 AM Henning Rohde <hero...@google.com> wrote:

> The Go build works best with Gradle and that's what you should use (other
> than possibly manually running go build etc). It looks like the build might
> be broken due to the Go streaming PR independently of the build tooling.
> Let me take a look.
>
> On Thu, Apr 19, 2018 at 7:09 AM Colm O hEigeartaigh <cohei...@apache.org>
> wrote:
>
>> Hi all,
>>
>> Is the Apache Maven build still maintained? I'm seeing some recent
>> failures in the Go SDK:
>>
>> [INFO] --- mvn-golang-wrapper:2.1.7:test (go-test) @ beam-sdks-go ---
>> [INFO] Prepared command line : bin/go test ./...
>> [ERROR]
>> [ERROR] -Exec.Err-
>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>> cloud.google.com/go/pubsub/subscription.go:30:2: cannot find package "
>> golang.org/x/sync/errgroup" in any of:
>> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
>> golang.org/x/sync/errgroup (from $GOROOT)
>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/golang.org/x/sync/errgroup
>> (from $GOPATH)
>> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
>> golang.org/x/sync/errgroup
>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>> cloud.google.com/go/pubsub/flow_controller.go:19:2: cannot find package "
>> golang.org/x/sync/semaphore" in any of:
>> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
>> golang.org/x/sync/semaphore (from $GOROOT)
>> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
>> golang.org/x/sync/semaphore (from $GOPATH)
>> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
>> golang.org/x/sync/semaphore
>> [ERROR]
>>
>> Incidentally, when run via gradle I see:
>>
>> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75:
>> wrong number of args for format in Errorf call: 1 needed but 2 args
>> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78:
>> wrong number of args for format in Errorf call: 1 needed but 2 args
>> /home/colm/src/apache/beam/sdks/go/target/src/
>> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75: wrong
>> number of args for format in Errorf call: 1 needed but 2 args
>> /home/colm/src/apache/beam/sdks/go/target/src/
>> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78: wrong
>> number of args for format in Errorf call: 1 needed but 2 args
>>
>>
>> Colm.
>>
>>
>> --
>> Colm O hEigeartaigh
>>
>> Talend Community Coder
>> http://coders.talend.com
>>
>


Re: [Go SDK] Proposal: Set up a Vanity Import Path

2018-04-19 Thread Henning Rohde
+1 Great proposal!

On Wed, Apr 18, 2018 at 1:37 PM Ismaël Mejía  wrote:

> Well it is not really a rule for a proposal but notice that there are
> people in different time zones or people that for different reasons
> cannot answer immediately, so a longer period could give them a chance
> to voice their opinions.
>
> On Wed, Apr 18, 2018 at 10:23 PM, Robert Burke  wrote:
> > That's good to know!
> > I had heard of that specific rule, but I didn't realized it pertained to
> > filing of a JIRA issue (when related to a proposal) as well.
> > Thank you.
> >
> > On Wed, 18 Apr 2018 at 13:08 Ismaël Mejía  wrote:
> >>
> >> +1 Nice idea and proposal.
> >>
> >> This was not a vote thread but for the future it is a good idea to let
> >> a bigger time window before reaching consensus.
> >> Notice that a formal vote lets at least 72h for participants to voice
> >> their opinion before concluding something.
> >>
> >> https://www.apache.org/foundation/voting.html
> >>
> >>
> >>
> >>
> >> On Wed, Apr 18, 2018 at 6:29 PM, Robert Burke 
> wrote:
> >> > This seems like enough consensus to file the JIRA, so
> >> > https://issues.apache.org/jira/browse/BEAM-4115 has now been created.
> >> >
> >> > I'll get to work on the PRs shortly.
> >> >
> >> > Cheers,
> >> > Robert Burke
> >> >
> >> > On Wed, 18 Apr 2018 at 03:52 Jean-Baptiste Onofré 
> >> > wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> Agree
> >> >> Regards
> >> >> JB
> >> >> Le 18 avr. 2018, à 14:51, Aljoscha Krettek  a
> >> >> écrit:
> >> >>>
> >> >>> +1 this sounds super reasonable
> >> >>>
> >> >>>
> >> >>> On 17. Apr 2018, at 20:11, Kenneth Knowles  wrote:
> >> >>>
> >> >>> This seems like a valuable layer of indirection to establish. The
> >> >>> mechanisms are pretty esoteric, but I trust Gophers to know the best
> >> >>> way to
> >> >>> do it. Commented just a smidgin on the doc.
> >> >>>
> >> >>> Kenn
> >> >>>
> >> >>> On Mon, Apr 16, 2018 at 4:57 PM Robert Burke 
> >> >>> wrote:
> >> 
> >>  Hi All!
> >>  While the Go SDK is still experimental, that doesn't mean it
> >>  shouldn't
> >>  be future proofed.
> >> 
> >>  Go has the ability to specify custom import paths for a prefix of
> >>  packages. This has benefits of avoiding generic GitHub paths, and
> >>  avoids
> >>  breaking users in the event of infrastructure events such as moving
> >>  off of
> >>  GitHub, or even splitting the repo into per language components.
> >> 
> >>  Currently users need to import paths like:
> >> 
> >>  import "github.com/apache/beam/sdks/go/pkg/beam/io/textio"
> >> 
> >>  to get at SDK packages. If we implement this proposal, they would
> >>  look
> >>  like:
> >> 
> >>  import "beam.apache.org/sdks/go/pkg/beam/io/textio"
> >> 
> >>  which are a bit shorter, a bit more stable, and a bit nicer, with
> the
> >>  benefits outlined above.
> >> 
> >>  I wrote a doc with details which is at
> >>  https://s.apache.org/go-beam-vanity-import
> >>  (Thanks you Thomas for short linking it for me.)
> >> 
> >>  The doc should answer most of your questions, but please let me
> know
> >>  if
> >>  you have others either here, or in a doc comment.
> >> 
> >>  If there's consensus to do so, it would be better it's done sooner
> >>  rather than after folks begin depending on it. We wouldn't want to
> >>  have
> >>  fragmented examples.
> >> 
> >>  Robert Burke
> >>  (One of the Gopher Googlers who have been quietly lurking on the
> >>  list,
> >>  and submitting the occasional PR for the Go SDK. I look forward to
> >>  working
> >>  with you all!)
> >> >>>
> >> >>>
> >> >
>


Re: Go SDK build failures with maven

2018-04-19 Thread Henning Rohde
The Go build works best with Gradle and that's what you should use (other
than possibly manually running go build etc). It looks like the build might
be broken due to the Go streaming PR independently of the build tooling.
Let me take a look.

On Thu, Apr 19, 2018 at 7:09 AM Colm O hEigeartaigh 
wrote:

> Hi all,
>
> Is the Apache Maven build still maintained? I'm seeing some recent
> failures in the Go SDK:
>
> [INFO] --- mvn-golang-wrapper:2.1.7:test (go-test) @ beam-sdks-go ---
> [INFO] Prepared command line : bin/go test ./...
> [ERROR]
> [ERROR] -Exec.Err-
> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
> cloud.google.com/go/pubsub/subscription.go:30:2: cannot find package "
> golang.org/x/sync/errgroup" in any of:
> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
> golang.org/x/sync/errgroup (from $GOROOT)
> [ERROR] /home/colm/.mvnGoLang/.go_path/src/golang.org/x/sync/errgroup
> (from $GOPATH)
> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
> golang.org/x/sync/errgroup
> [ERROR] /home/colm/.mvnGoLang/.go_path/src/
> cloud.google.com/go/pubsub/flow_controller.go:19:2: cannot find package "
> golang.org/x/sync/semaphore" in any of:
> [ERROR] /home/colm/.mvnGoLang/go1.9.linux-amd64/src/
> golang.org/x/sync/semaphore (from $GOROOT)
> [ERROR] /home/colm/.mvnGoLang/.go_path/src/golang.org/x/sync/semaphore
> (from $GOPATH)
> [ERROR] /home/colm/src/apache/beam/sdks/go/target/src/
> golang.org/x/sync/semaphore
> [ERROR]
>
> Incidentally, when run via gradle I see:
>
> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75:
> wrong number of args for format in Errorf call: 1 needed but 2 args
> /home/colm/src/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78:
> wrong number of args for format in Errorf call: 1 needed but 2 args
> /home/colm/src/apache/beam/sdks/go/target/src/
> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:75: wrong
> number of args for format in Errorf call: 1 needed but 2 args
> /home/colm/src/apache/beam/sdks/go/target/src/
> github.com/apache/beam/sdks/go/pkg/beam/util/pubsubx/pubsub.go:78: wrong
> number of args for format in Errorf call: 1 needed but 2 args
>
>
> Colm.
>
>
> --
> Colm O hEigeartaigh
>
> Talend Community Coder
> http://coders.talend.com
>


Re: Add a (temporary) Portable Flink branch to the ASF repo?

2018-04-12 Thread Henning Rohde
+ 1 to capture in JIRA what needs to be done.

The simplest path forward might be to reimplement/cherrypick'n'modify the
changes onto master directly. We would then effectively just abandon the
hacking branch and treat code there as a prototype. Although we would add
components without end2end verification initially, it would allow parallel
progress across the SDKs and the Flink runner. The Go SDK is also already
in master and can help test the migration of the Flink runner changes
before the other SDKs are ready.

On Thu, Apr 12, 2018 at 4:44 PM Thomas Weise  wrote:

> Strong +1 on transitioning all development to the ASF repo.
>
> I think a straight move of the hacking branch may still be problematic
> though, because it sets the path to continue hacking vs. working towards a
> viable milestone that other contributors can base their work off. I would
> prefer a state that separates serious development from hacks in a way where
> the code does not overlap.
>
> Based on Ben's assessment, if most of the hacks are currently are in the
> SDK area, then perhaps we can transition everything related to job server
> and translation to master so that it is possible to build and work on the
> runner there and then only use  the hacked SDKs branch for demos?
>
> And maybe discuss an MVP milestone and put together a JIRA view that shows
> what remains to be done to get there?
>
> Thanks,
> Thomas
>
>
> On Thu, Apr 12, 2018 at 4:26 PM, Holden Karau 
> wrote:
>
>> So I would be strongly in favour of adding it as a branch on the Apache
>> repo. This way other folks are more likely to be able to help with the
>> splitting up and merging process and also while Flink forward is behind us
>> getting in the practice of doing feature branches on the ASF repo for
>> collaboration instead of personal github accounts seems like a worthy goal.
>>
>> On Thu, Apr 12, 2018 at 4:21 PM Robert Bradshaw 
>> wrote:
>>
>>> I suppose with the hackathon and flink forward behind us, I'm thinking we
>>> should start shifting gears more getting what we have into master in
>>> production state and less on continuing working on a hacking branch. If
>>> we
>>> think it'll fairly quick there's no big need to create an official
>>> branch,
>>> and if it's going to be long lived perhaps we should rethink our process.
>>> On Thu, Apr 12, 2018 at 3:44 PM Aljoscha Krettek 
>>> wrote:
>>>
>>> > I would also be in favour of adding a branch to our main repo. A random
>>> branch on some personal GitHub account can seem a bit sketchy and adding
>>> a
>>> branch to our repo could make it more visible for people that are
>>> interested.
>>>
>>>
>>>
>>> > On 12. Apr 2018, at 15:29, Ben Sidhom  wrote:
>>>
>>> > I would say that most of it is not suitable for direct merging. There
>>> are
>>> several reasons for this:
>>>
>>> > Most changes are built on upstream PRs that are either not submitted or
>>> have been rebased before submission.
>>> > There are some very hacky changes in the Python and Java SDKs to get
>>> portable pipelines working. For example, hard coding certain options
>>> and/or
>>> baking dependencies into the SDK harness images. These need to be
>>> actually
>>> implemented correctly in their respective SDKs.
>>> > Much of the code does not have proper tests and fails simple lint
>>> tests.
>>>
>>> > As a concrete example, I tried cherry-picking the changes from
>>> https://github.com/bsidhom/beam/pull/46 into master. This is a
>>> relatively
>>> simple change, but there were so many merge conflicts that in the end it
>>> was easier to just reimplement the changes atop master. More importantly,
>>> most changes will require refactoring before actually going in.
>>>
>>> > On Thu, Apr 12, 2018 at 3:16 PM, Robert Bradshaw 
>>> wrote:
>>>
>>> >> How much of this is not suitable to merging into master directly (not
>>> as
>>> >> is, but as separate PRs)?
>>> >> On Thu, Apr 12, 2018 at 3:10 PM Ben Sidhom  wrote:
>>>
>>> >> > Hey all,
>>>
>>> >> > I've been working on a proof-of-concept portable Flink runner with
>>> some
>>> >> other Beam contributors. We would like to have a point of reference
>>> for
>>> the
>>> >> rest of the Beam community as we integrate this work into master. It
>>> >> currently lives under
>>> >> https://github.com/bsidhom/beam/tree/hacking-job-server.
>>>
>>> >> > I would suggest pulling this into the main ASF repo under an
>>> >> appropriately-named branch (flink-portable-hacking?). The name should
>>> >> suggest the intention that this branch is not intended to be pulled
>>> into
>>> >> master as-is and that it should rather be used as a reference for now.
>>>
>>> >> > Thoughts?
>>>
>>> >> > --
>>> >> > -Ben
>>>
>>>
>>>
>>>
>>> > --
>>> > -Ben
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: Golang Beam SDK GroupByKey not working when running locally

2018-03-30 Thread Henning Rohde
Hi 8,

 This is a bug in the Go SDK regarding direct output after GBK. As a
workaround, if you change this signature

func(word string, values func(*int) bool) (string, int)

to

func(word string, values func(*int) bool, emit func (string, int))

and emits the result instead of returning it, it works. Opened
https://issues.apache.org/jira/browse/BEAM-3978.

Thanks,
 Henning

PS: Btw, the minimal_wordcount doesn't log the direct.Execute error (among
other things) and is there mainly to mimic the progression in Java. It's
not a good model for real pipelines.



On Fri, Mar 30, 2018 at 5:21 PM 8 Gianfortoni <8...@tokentransit.com> wrote:

> Oh, I forgot to mention that I pulled from master with this commit as
> latest:
> https://github.com/apache/beam/commit/95a524e52606de1467b5d8b2cc99263b8a111a8d
>
>
>
> On Fri, Mar 30, 2018, 5:09 PM 8 Gianfortoni <8...@tokentransit.com> wrote:
>
>> Fix cc to correct Holden.
>>
>> On Fri, Mar 30, 2018 at 5:05 PM, 8 Gianfortoni <8...@tokentransit.com>
>> wrote:
>>
>>> Hi dev team,
>>>
>>> I'm having a lot of trouble running any pipeline that calls GroupByKey.
>>> Maybe I'm doing something wrong, but for some reason I cannot get
>>> GroupByKey not to crash the program.
>>>
>>> I have edited wordcount.go and minimal_wordcount.go to work similarly
>>> to my own program, and it crashes for those as well.
>>>
>>> Here is the snippet of code I added to minimal_wordcount (full source
>>> attached):
>>>
>>> // Concept #3: Invoke the stats.Count transform on our
>>> PCollection of
>>>
>>> // individual words. The Count transform returns a new
>>> PCollection of
>>>
>>> // key/value pairs, where each key represents a unique word in
>>> the text.
>>>
>>> // The associated value is the occurrence count for that word.
>>>
>>> singles := beam.ParDo(s, func(word string) (string, int) {
>>>
>>> return word, 1
>>>
>>> }, words)
>>>
>>>
>>> grouped := beam.GroupByKey(s, singles)
>>>
>>>
>>> counted := beam.ParDo(s, func(word string, values func(*int)
>>> bool) (string, int) {
>>>
>>> sum := 0
>>>
>>> for {
>>>
>>> var i int
>>>
>>> if values() {
>>>
>>> sum = sum + i
>>>
>>> } else {
>>>
>>> break
>>>
>>> }
>>>
>>> }
>>>
>>> return word, sum
>>>
>>> }, grouped)
>>>
>>>
>>> // Use a ParDo to format our PCollection of word counts into a
>>> printable
>>>
>>> // string, suitable for writing to an output file. When each
>>> element
>>>
>>> // produces exactly one element, the DoFn can simply return it.
>>>
>>> formatted := beam.ParDo(s, func(w string, c int) string {
>>>
>>> return fmt.Sprintf("%s: %v", w, c)
>>>
>>> }, counted)
>>>
>>>
>>>
>>> I also attached the full source code and output that happens when I run
>>> both wordcount and minimal_wordcount.
>>>
>>> Am I just doing something wrong here? In any case, it seems
>>> inappropriate to panic during runtime without any debugging information
>>> (save a stack trace, but only if you call beamx.Run() as opposed to
>>> direct.Execute(), which just dies without any info.
>>>
>>> Thank you so much,
>>> 8
>>>
>>
>>


Re: Migrating to Gradle: Community Fixit day

2018-03-23 Thread Henning Rohde
I can help with at least the Go and docker aspects and prefer April 3rd as
well.


On Fri, Mar 23, 2018 at 2:08 PM Reuven Lax  wrote:

> Yes, due to time-zone differences between participants, a one-day fixit
> will probably actually last two days :)
>
>
> On Fri, Mar 23, 2018 at 2:03 PM Romain Manni-Bucau 
> wrote:
>
>> Hi Reuven
>>
>> I can try to help on the 3rd (don't forget you are in the future for me
>> so you can need to launch it on the 2nd for you maybe) but not on the 28th
>> :(.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-23 22:00 GMT+01:00 Reuven Lax :
>>
>>> Hi,
>>>
>>> Late last November we voted on migrating our build process from Maven to
>>> Gradle. The vote at the time was specifically about incremental migration,
>>> with the statement that as each specific process was migrated to Gradle we
>>> would stop maintaining Maven for that process. The vote concluded with
>>> 16 +1s (4 binding), 1 -0 (binding), and a single -1 (not binding).
>>>
>>> However it's now been over three months, and while a lot of progress has
>>> been made in the migration, we're still stuck maintaining two separate
>>> build systems. I propose that we make a concerted push to finish this
>>> migration, and that it should be a community effort.
>>>
>>> Specifically, I'm proposing a community "FixIt" day focused on migrating
>>> the remaining items. I'll volunteer to help organize and coordinate this
>>> effort. In addition to finishing the migration, this is a great opportunity
>>> to build community cohesiveness.
>>>
>>> If you're interested in participating, please respond and let me know
>>> along with which dates work best for you. I'll start off by suggesting one
>>> of two days: Wednesday March 28 or Tuesday April 3.
>>>
>>> Reuven
>>>
>>
>>


Re: Gradle status

2018-03-22 Thread Henning Rohde
My understanding was the same as Ismaël's. I don't think breaking the build
with a large known gaps (but not fully known cost) is practical. Also, most
items in the jira are not even assigned yet.


On Thu, Mar 22, 2018 at 8:03 AM Romain Manni-Bucau 
wrote:

> Not really Ismaël, this thread was about to do it at once and have 1 day
> to fix it all.
>
> As mentionned at the very beginning nobody maintains the 2 system so it
> must stop after months so either we drop maven or gradle *at once*
> or we keep a state where each dev does what he wants and the build system
> just doesn't work.
>
> 2018-03-22 15:42 GMT+01:00 Ismaël Mejía :
>
>> I don't think that removing all maven descriptors was the expected
>> path, no ? Or even a good idea at this moment.
>>
>> I understood that what we were going to do was to replace
>> incrementally the CI until we cover the whole maven functionality and
>> then remove it, from looking at the JIRA ticket
>> https://issues.apache.org/jira/browse/BEAM-3249 we are still far from
>> covering the complete maven functionality in particular for the
>> release part that could be the biggest pain point.
>>
>>
>> On Thu, Mar 22, 2018 at 9:30 AM, Romain Manni-Bucau
>>  wrote:
>> > hey guys,
>> >
>> > 2.4 is out, do we plan to drop all maven descriptors tomorrow or on
>> monday?
>> >
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
>> >
>> > 2018-03-09 21:42 GMT+01:00 Kenneth Knowles :
>> >>
>> >> On Fri, Mar 9, 2018 at 12:16 PM Lukasz Cwik  wrote:
>> >>>
>> >>> Based upon your description it seems as though you would rather have a
>> >>> way to run existing postcommits without it impacting dashboard/health
>> >>> stats/notifications/ (We have just run the PostCommits on PRs for
>> >>> additional validation (like upgrading the Dataflow container image)).
>> >>
>> >>
>> >> Yes, that is exactly what I have described.
>> >>
>> >>> I don't think that keeping the current Java PreCommit as a proxy for
>> the
>> >>> the Java PostCommit is the right way to go but I also don't have the
>> time to
>> >>> implement what your actually asking for.
>> >>
>> >>
>> >> Mostly I thought this might be very easy based on the fact that they
>> are
>> >> nearly identical. If not, oh well.
>> >>
>> >> Kenn
>> >>
>> >>
>> >>> It seems more likely that migrating the PostCommit to Gradle will be
>> less
>> >>> work then adding the functionality but your argument where the
>> PreCommit is
>> >>> a proxy for the Java PostCommit also applies to the ValidatesRunner
>> >>> PostCommits and so forth requiring even more migration to happen
>> before you
>> >>> don't have to worry about maintaining Maven/breaking post commits.
>> >>>
>> >>> I'm fine with leaving both the Java/Gradle PreCommits running for now
>> and
>> >>> hopefully as more of the PostCommits are migrated off we will be able
>> to
>> >>> remove it.
>> >>>
>> >>> On Fri, Mar 9, 2018 at 11:39 AM, Kenneth Knowles 
>> wrote:
>> 
>>  Separate history (for easy dashboarding, health stats, etc) and
>>  notification (email to dev@ for postcommits, nothing for
>> precommits) for pre
>>  & post commit targets.
>> 
>>  A post commit failure is always a problem to be triaged at high
>>  priority, while a precommit failure is just a natural occurrence.
>> 
>>  On Fri, Mar 9, 2018 at 11:33 AM Lukasz Cwik 
>> wrote:
>> >
>> > Ken, I'm probably not seeing something but how does using the
>> PreCommit
>> > as a proxy improve upon just running the post commit via the phrase
>> it
>> > already supports ('Run Java PostCommit')?
>> >
>> > On Fri, Mar 9, 2018 at 11:22 AM, Kenneth Knowles 
>> > wrote:
>> >>
>> >> Indeed, we've already had the discussion a couple of times and I
>> think
>> >> the criteria are clearly met. Incremental progress is a good thing
>> and we
>> >> shouldn't block it.
>> >>
>> >> OTOH I see where Romain is coming from and I have a good example
>> that
>> >> supports a slightly different action. Consider
>> >> https://github.com/apache/beam/pull/4740 which fixes some errors
>> in how we
>> >> use dependency mechanisms.
>> >>
>> >> This PR is green except that I need to fix some Maven pom slightly
>> >> more. That is throwaway work. I would love to just not have to do
>> it. But
>> >> removing the precommit does not actually make the PR OK to merge.
>> It would
>> >> cause postcommits to fail.
>> >>
>> >> We can hope such situations are rare. I think I tend to be hit by
>> this
>> >> more often than most, as I work with the project build health
>> quite a bit.
>> >>
>> >> Here is a proposal to support these things: instead of deleting the
>> >> job in #4814, move it to not run automatically but only via a
>> phrase. 

Re: Common model for runners

2018-03-20 Thread Henning Rohde
Go currently prints out the model pipeline (as well as the Dataflow
representation) if you use the Dataflow runner. Pass --dry_run=true to not
actually submit a job, but just print out the representations. The graphx
package can also be used to generate a model pipeline manually.


On Tue, Mar 20, 2018 at 3:19 PM Robert Bradshaw  wrote:

> The proto representation isn't (yet) part of the public API, and is still
> under active development. However, if you're curious you can see it via
> calling
>
> pipeline.to_runner_api()
>
> in Python or manually invoking classes under
>
>
> https://github.com/apache/beam/tree/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction
>
> for Java. There's probably some equivalent for Go.
>
> (Eventually, the way to get at this would be to implement a Runner whose
> input would be the proto description itself, but currently both Python and
> Java represent Pipelines as native objects when invoking runners.)
>
>
>
> On Tue, Mar 20, 2018 at 2:58 PM Ron Gonzalez  wrote:
>
>> Hi,
>>   When I build a data flow using the Beam SDK, can someone point me to
>> the code that represents the underlying representation of the beam model
>> itself?
>>   Is there an API that lets me retrieve the underlying protobuf-based
>> graph for the data flow? Perhaps some pointers to what code in the runner
>> retrieves this model in order to execute it in the specific engine?
>>
>> Thanks,
>> Ron
>>
>


Re: Using the Go Beam SDK

2018-03-16 Thread Henning Rohde
Hi Philip,

 Thanks for expressing interest in the Go SDK! The documentation is indeed
still incomplete (BEAM-3826) and the main design document is probably be
the best starting point right now:

 https://s.apache.org/beam-go-sdk-design-rfc

It also contains links to some of the better documented examples in the
repository.

We do plan to prototype Go streaming w/ PubSub on Dataflow in the near
future (BEAM-3854, BEAM-3856). Is that the setup you're mainly looking at
as well? For Datastore reading, Braden even has a PR in flight. So things
are moving along in that direction and there are definitely lots of
opportunities to contribute. The long-term IO story will likely be based on
some combination of Go SplittableDoFns and cross-language IO.

Please feel free to ask questions, open JIRAs or send PRs. Love to hear
your feedback!

Henning


On Thu, Mar 15, 2018 at 7:21 PM Philip Gianfortoni <8...@tokentransit.com>
wrote:

> Hi dev team,
>
> I am an engineer at Token Transit, a company working on a mobile ticketing
> solution for transit companies. Through Holden (who's working on Py3 w/Beam
> and doing some dev advocacy), we heard about the experimental newly merged
> WIP Go Beam SDK. We are extremely excited about exploring and contributing
> to this new offering from Beam, as almost our entire codebase is written in
> Go, and we are looking for a bulk data processing solution that is
> compatible.
>
> Are there any code pointers, examples, or other docs that someone could
> point me to about this new SDK? Holden pointed us to this repository
> , but it does seem
> that documentation is a bit light at the moment. Any extra examples or docs
> would be greatly appreciated.
>
> Another thing that we were wondering was whether and/or when a PubSub or
> Datastore API will be supported. We are looking at using PubSub to process
> our data incrementally, and in the long term, support for this feature is a
> requirement.
>
> Looking forward to trying out this SDK. Thank you so much!
>
> 8
>


Re: The Go SDK got accidentally merged - options to deal with the pain

2018-03-10 Thread Henning Rohde
Thank you all! I've added the remaining work -- as I understand it -- as
dependencies to the overall Go SDK issue (tracking the "official" merge to
master):

https://issues.apache.org/jira/browse/BEAM-2083

Please feel free to add to this list or expand the items, if there is
anything I overlooked. If this presence of the Go SDK in master cause
issues for other modules, please simply file a bug against me and I'll take
care of it.

Robert - I understand your last reply as addressing Davor's points. Please
let me know if there is anything I need to do in that regard.

Henning



On Fri, Mar 9, 2018 at 8:39 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> +1 to let it evolve in master (+Davor points), having ongoing work on
> master makes sense given the state of advance + the hope that this
> won't add any issue for the other modules.
>
> On Thu, Mar 8, 2018 at 7:30 PM, Robert Bradshaw <rober...@google.com>
> wrote:
> > +1 to both of these points. SGA should have probably already been filed,
> and
> > excising this from releases should be easy, but I added a line item to
> the
> > validation checklist template to make sure we don't forget.
> >
> > On Thu, Mar 8, 2018 at 7:13 AM Davor Bonaci <da...@apache.org> wrote:
> >>
> >> I support leaving things as they stand now -- thanks for finding a good
> >> way out of an uncomfortable situation.
> >>
> >> That said, two things need to happen:
> >> (1) SGA needs to be filed asap, per Board feedback in the last report,
> and
> >> (2) releases cannot contain any code from the Go SDK before formally
> voted
> >> on the new component and accepted. This includes source releases that
> are
> >> created through "assembly", so manual exclusion in the configuration is
> >> likely needed.
> >>
> >> On Wed, Mar 7, 2018 at 1:54 PM, Kenneth Knowles <k...@google.com> wrote:
> >>>
> >>> Re-reading the old thread, I see these desirata:
> >>>
> >>>  - "enough IO to write end-to-end examples such as WordCount and
> >>> demonstrate what IOs would look like"
> >>>  - "accounting and tracking the fact that each element has an
> associated
> >>> window and timestamp"
> >>>  - "test suites and test utilities"
> >>>
> >>> Browsing the code, it looks like these each exist to some level of
> >>> completion.
> >>>
> >>> Kenn
> >>>
> >>>
> >>> On Wed, Mar 7, 2018 at 1:38 PM Robert Bradshaw <rober...@google.com>
> >>> wrote:
> >>>>
> >>>> I was actually thinking along the same lines: what was yet lacking to
> >>>> "officially" merge the Go branch in? The thread we started on this
> seems to
> >>>> have fizzled out over the holidays, but windowing support is the only
> >>>> must-have missing technical feature in my book (assuming
> documentation and
> >>>> testing are, or are brought up to snuff).
> >>>>
> >>>>
> >>>> On Wed, Mar 7, 2018 at 1:35 PM Henning Rohde <hero...@google.com>
> wrote:
> >>>>>
> >>>>> One thought: the Go SDK is actually not that far away from satisfying
> >>>>> the guidelines for merging to master anyway (as discussed here [1]).
> If we
> >>>>> decide to simply leave the code in master -- which seems to be what
> this
> >>>>> thread is leaning towards -- I'll gladly sign up to do the remaining
> aspects
> >>>>> (I believe it's only windowing, validation tests and documentation)
> >>>>> reasonably quickly to get to an official vote for accepting it and
> in turn
> >>>>> get master into a sound state. It would seem like the path of least
> hassle.
> >>>>> Of course, I'm happy to go with whatever the community is
> comfortable with
> >>>>> -- just trying to make lemonade out of the merge lemon.
> >>>>>
> >>>>> Henning
> >>>>>
> >>>>> [1]
> >>>>> https://lists.apache.org/thread.html/fd4201980d7a6e67248b1f183ee06b
> 0ff1305bd46f1291495679fc0a@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>> On Tue, Mar 6, 2018 at 3:40 PM, Kenneth Knowles <k...@google.com>
> wrote:
> >>>>>>
> >>>>>> I think a very easy fix to unblock everyone is
> >>>>>> https://github.com/apache/beam/pull/4809. It j

Re: Gradle status

2018-03-07 Thread Henning Rohde
+1 to a Gradle fixit day/week. I agree with Romain that we should make an
effort quit this dual state. The Go, Go PreCommit and various container
image Gradle builds, I think, are in a reasonable state (modulo some
documentation updates).

On Wed, Mar 7, 2018 at 1:29 PM, Kenneth Knowles  wrote:

> SGTM. I should also say that every day is Gradle fixit day for me, as I
> have been using only Gradle (with IntelliJ) for a while :-). If anyone is
> hesitant, definitely it is ready to be used for normal dev.
>
> Seems like changing the messaging in onboarding docs is the main thing to
> fixit.
>
> Based on https://builds.apache.org/view/A-D/view/Beam/ and our failure
> spam level the performance tests are mostly not healthy anyhow. So is there
> any high level blocker to switching them or is it just someone sitting down
> with each one?
>
> Kenn
>
>
> On Wed, Mar 7, 2018 at 1:22 PM Lukasz Cwik  wrote:
>
>> Largest outstanding areas are:
>> * Documentation relevant to the contributors guide/release process/testing
>> * Performance tests
>>
>> There has been good progress towards:
>> * Release artifact validations and generation
>> * ValidatesRunner post commits
>> * Pre commits
>> * Container builds
>>
>>
>> On Wed, Mar 7, 2018 at 1:19 PM, Reuven Lax  wrote:
>>
>>> I think Alan was making progress on the Gradle build.
>>>
>>> What do people think of a "fixit" day for Gradle work? (or given that
>>> people are distributed, maybe a fixit week, where everyone takes one day
>>> from the week).
>>>
>>>
>>> On Wed, Mar 7, 2018 at 1:17 PM Kenneth Knowles  wrote:
>>>
 I also cannot drop everything to work on Gradle build, but maybe it
 isn't that drastic anyhow. Now that we have ValidatesRunner and NeedsRunner
 tests and some progress on the release, is there any other known missing
 functionality in the Gradle builds? Archetypes? Docker container images?


 On Wed, Mar 7, 2018 at 1:12 PM Lukasz Cwik  wrote:

> I am working on various projects and may not be able to pause my work
> for a couple of weeks while the build/test process is migrated.
>
> What is everyone thinking about Romain's suggestion because If I'm the
> only person in such a situation, I would be willing to go along with the
> plan.
>
> On Wed, Mar 7, 2018 at 12:59 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>>
>> Le 7 mars 2018 20:21, "Lukasz Cwik"  a écrit :
>>
>> Note that Alan Myrvold has been making steady progress making the
>> release process via Gradle a reality:
>> 1) Creating a jenkins job which can run the quickstart validation
>> against nightly snapshots and also can be used for release candidates (
>> https://github.com/apache/beam/pull/4252)
>> 2) Building a release candidate using Gradle (
>> https://github.com/apache/beam/pull/4812)
>>
>> Also, Gradle is the tool that has been selected already and there was
>> a community discussion about what was needed for the migration to occur
>> with a clear set of criteria. Romain, it doesn't seem like we should 
>> ignore
>> that criteria or are you suggesting we change that criteria (if yes, 
>> how)?
>>
>>
>> No, no. My goal is just to quit this state.
>>
>> Let s draft a plan:
>>
>> 1. 2.4 is released - i assume it is done with mvn here
>> 2. We drop all poms and jenkins mvn config
>> 3. We fix all build issues if so (let say in a week)
>> 4. Pr can nees updates but no more mvn merge
>>
>> April is gradle month :)
>>
>> Wdyt?
>>
>>
>>
>>
>> On Wed, Mar 7, 2018 at 10:39 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>>
>>> Le 7 mars 2018 17:34, "Lukasz Cwik"  a écrit :
>>>
>>> Thanks for bringing this up Romain but I believe your data points on
>>> pass rates are only partially correct.
>>>
>>>
>>> Sure sure, it is mainly about my own PR which a very small % of the
>>> whole project ;).
>>>
>>>
>>>
>>> In the past week the Java Gradle precommit passed 46.34% of the time
>>> compared to the Java Maven precommit which passed 46.15% of the time. 
>>> When
>>> I looked at these numbers in mid January they were around 37% so there 
>>> has
>>> been some improvement. Regardless of the build tool it seems that our 
>>> pass
>>> rates aren't stellar for the Java build and are causing the community to
>>> not follow best practices (wait for precommits to be green before 
>>> merging).
>>> I know that on the website we used the mergebot to ensure that things
>>> passed before they were merged, should we institute this on the master
>>> branch or are their any other ideas?
>>>
>>> As a side note we had achieved 

Re: The Go SDK got accidentally merged - options to deal with the pain

2018-03-07 Thread Henning Rohde
One thought: the Go SDK is actually not that far away from satisfying the
guidelines for merging to master anyway (as discussed here [1]). If we
decide to simply leave the code in master -- which seems to be what this
thread is leaning towards -- I'll gladly sign up to do the remaining
aspects (I believe it's only windowing, validation tests and documentation)
reasonably quickly to get to an official vote for accepting it and in turn
get master into a sound state. It would seem like the path of least hassle.
Of course, I'm happy to go with whatever the community is comfortable with
-- just trying to make lemonade out of the merge lemon.

Henning

[1]
https://lists.apache.org/thread.html/fd4201980d7a6e67248b1f183ee06b0ff1305bd46f1291495679fc0a@%3Cdev.beam.apache.org%3E

On Tue, Mar 6, 2018 at 3:40 PM, Kenneth Knowles  wrote:

> I think a very easy fix to unblock everyone is https://github.com/apache/
> beam/pull/4809. It just updates one line of a pom.
>
>
> On Tue, Mar 6, 2018 at 3:33 PM Robert Bradshaw 
> wrote:
>
>> I'm not sure what value there is in preserving this accidental merge in
>> history, but all options proposed seem fine to me. We should resolve this
>> (or at least unblock other dev work) quickly though.
>>
>>
>> On Tue, Mar 6, 2018 at 3:16 PM Kenneth Knowles  wrote:
>>
>>> My own vote is for leaving the history immutable, which is the case for
>>> the full rollback or leaving it there disabled.
>>>
>>>
>>> On Tue, Mar 6, 2018 at 3:01 PM Thomas Weise  wrote:
>>>
 +1 for (1), assuming it is straightforward to exclude from the build
 and eventually will end up in master anyways.

 On Tue, Mar 6, 2018 at 2:59 PM, Robert Bradshaw 
 wrote:

> I would opt for (2), but I'm not sure who has permissions to do that.
> It should be easy to re-merge the couple of things that have gone in since
> then.
>
>
> On Tue, Mar 6, 2018 at 2:43 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> You may have noticed that our tests are red. A pull request that was
>> meant for the Go SDK branch accidentally got merged onto the master 
>> branch.
>> Things have been merged to master since then.
>>
>> I've opened a revert at https://github.com/apache/beam/pull/4808
>>
>> The next time there is a master to go-sdk merge it will need to be
>> re-reverted.
>>
>> Two other options are (1) leave it there and disable it in whatever
>> way and (2) rebase dropping the commit and force push master (breaks all
>> checkouts that are past it).
>>
>> Kenn
>>
>



Re: The Go SDK got accidentally merged - options to deal with the pain

2018-03-06 Thread Henning Rohde
I'm happy to deal with any needed fixup on the Go SDK side in either case.

On Tue, Mar 6, 2018 at 3:20 PM, Lukasz Cwik  wrote:

> Can we open up the pair of commits so that master gets reverted and the Go
> SDK merges from master plus another rollback?
>
> On Tue, Mar 6, 2018 at 2:42 PM, Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> You may have noticed that our tests are red. A pull request that was
>> meant for the Go SDK branch accidentally got merged onto the master branch.
>> Things have been merged to master since then.
>>
>> I've opened a revert at https://github.com/apache/beam/pull/4808
>>
>> The next time there is a master to go-sdk merge it will need to be
>> re-reverted.
>>
>> Two other options are (1) leave it there and disable it in whatever way
>> and (2) rebase dropping the commit and force push master (breaks all
>> checkouts that are past it).
>>
>> Kenn
>>
>
>


Re: beam_PreCommit_Go_MavenInstall appears perma-red

2018-02-12 Thread Henning Rohde
Ah. It looks like an issue with pulling dependencies. There is a followup
PR that might address it: https://github.com/apache/beam/pull/4633. Sorry
for the noise.

On Mon, Feb 12, 2018 at 9:48 AM, Kenneth Knowles <k...@google.com> wrote:

> Oh, yes. Copy/paste fail. I mean the gradle one.
>
> On Mon, Feb 12, 2018 at 9:46 AM, Henning Rohde <hero...@google.com> wrote:
>
>> Let me take a look. I was not aware there was a regression with the Maven
>> Go precommit (or do you mean the Gradle one?).
>>
>> Henning
>>
>> On Mon, Feb 12, 2018 at 8:59 AM, Kenneth Knowles <k...@google.com> wrote:
>>
>>> Does anyone have context on the status here? I know it is a new build.
>>> Perhaps it can be comment-triggered only until it is reliable enough that
>>> we would be willing to block merges on it?
>>>
>>> Kenn
>>>
>>
>>
>


Re: beam_PreCommit_Go_MavenInstall appears perma-red

2018-02-12 Thread Henning Rohde
Let me take a look. I was not aware there was a regression with the Maven
Go precommit (or do you mean the Gradle one?).

Henning

On Mon, Feb 12, 2018 at 8:59 AM, Kenneth Knowles  wrote:

> Does anyone have context on the status here? I know it is a new build.
> Perhaps it can be comment-triggered only until it is reliable enough that
> we would be willing to block merges on it?
>
> Kenn
>


Re: A 15x speed-up in local Python DirectRunner execution

2018-02-08 Thread Henning Rohde
Awesome! Well done, Charles.

On Thu, Feb 8, 2018 at 9:10 AM, Ismaël Mejía  wrote:

> Sounds impressive, and with the extra portability stuff, great !
> Worth the switch just for he user experience improvement.
>
> On Thu, Feb 8, 2018 at 5:52 PM, Robert Bradshaw 
> wrote:
> > This is going to be a great improvement for our users! I'll take a
> > look at the pull request.
> >
> > On Wed, Feb 7, 2018 at 7:03 PM, Kenneth Knowles  wrote:
> >> Nice!
> >>
> >> On Wed, Feb 7, 2018 at 6:45 PM, Charles Chen  wrote:
> >>>
> >>> The existing DirectRunner will be needed for the foreseeable future
> since
> >>> it is currently the only local runner that supports streaming
> execution.
> >>>
> >>>
> >>> On Wed, Feb 7, 2018, 6:39 PM Pablo Estrada  wrote:
> 
>  Very cool Charles! Have you considered whether you'll want to remove
> the
>  direct runner code afterwards?
>  Best
>  -P.
> 
> 
>  On Wed, Feb 7, 2018, 6:25 PM Lukasz Cwik  wrote:
> >
> > That is pretty awesome.
> >
> > On Wed, Feb 7, 2018 at 5:55 PM, Charles Chen  wrote:
> >>
> >> Local execution of Beam pipelines on the Python DirectRunner
> currently
> >> suffers from performance issues, which makes it hard for pipeline
> authors to
> >> iterate, especially on medium to large size datasets.  We would
> like to
> >> optimize and make this a better experience for Beam users.
> >>
> >>
> >> The FnApiRunner was written as a way of leveraging the portability
> >> framework execution code path for local portability development.
> We've found
> >> it also provides great speedups in batch execution with no user
> changes
> >> required, so we propose to switch to use this runner by default in
> batch
> >> pipelines.  For example, WordCount on the Shakespeare dataset with
> a single
> >> CPU core now takes 50 seconds to run, compared to 12 minutes
> before; this is
> >> a 15x performance improvement that users can get for free, with no
> user
> >> pipeline changes.
> >>
> >>
> >> The JIRA for this change is here
> >> (https://issues.apache.org/jira/browse/BEAM-3644), and a candidate
> patch is
> >> available here (https://github.com/apache/beam/pull/4634). I have
> been
> >> working over the last month on making this an automatic drop-in
> replacement
> >> for the current DirectRunner when applicable.  Before it becomes the
> >> default, you can try this runner now by manually specifying
> >> apache_beam.runners.portability.fn_api_runner.FnApiRunner as the
> runner.
> >>
> >>
> >> Even with this change, local Python pipeline execution can only
> >> effectively use one core because of the Python GIL.  A natural next
> step to
> >> further improve performance will be to refactor the FnApiRunner to
> allow for
> >> multi-process execution.  This is being tracked here
> >> (https://issues.apache.org/jira/browse/BEAM-3645).
> >>
> >>
> >> Best,
> >>
> >> Charles
> >
> >
> 
> 
>  --
>  Got feedback? go/pabloem-feedback
> >>
> >>
>


Re: Samza Runner

2018-01-25 Thread Henning Rohde
+1 Exciting to see a new runner!

On Thu, Jan 25, 2018 at 8:56 PM, Jesse Anderson 
wrote:

> Excellent!
>
> On Fri, Jan 26, 2018, 5:37 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> In case you haven't noticed or followed, there's a new runner in PR:
>> Samza!
>>
>> https://github.com/apache/beam/pull/4340
>>
>> It has been under review and revision for some time. In local mode it
>> passes a solid suite of ValidatesRunner tests (I don't have a Samza
>> deployment handy to test non-local).
>>
>> Given all this, I am ready to put it on a feature branch where it can
>> mature further, and we can build out our CI for it, etc, until we agree it
>> is ready for master.
>>
>> Kenn
>>
>


Re: Dataflow runner examples build fail

2018-01-08 Thread Henning Rohde
+1

On Mon, Jan 8, 2018 at 1:32 AM, Ted Yu  wrote:

> +1
>
>  Original message 
> From: Jean-Baptiste Onofré 
> Date: 1/8/18 1:26 AM (GMT-08:00)
> To: dev@beam.apache.org
> Subject: Dataflow runner examples build fail
>
> Hi guys,
>
> The PRs and nightly builds are failing due to an issue with the dataflow
> platform: it seems we have a disk quota exceeded on the us-central1 region.
>
> I would like to do a clean out and increase the quota a bit.
>
> Thoughts ?
>
> Thanks
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Guidelines for merging new runners and SDKs into master

2017-12-20 Thread Henning Rohde
Thanks for the comments!


> > (2) What are the minimal set of IO connectors a new SDK must support?
> Given
> > the upcoming cross-language feature in the portability framework, can we
> > rely on that to meet the requirement
> > without implementing any native IO connectors?
>
> It could be argued that there needs to be enough IO to write
> end-to-end examples such as WordCount and demonstrate what IOs would
> look like. TextIO may satisfy this. Once we have cross-language
> universal local runners, we could eschew even that (and perhaps the
> requirement would be simply that it runs against that runner).
>

Yes -- TextIO is something that naturally is added with a direct runner,
but with cross-language IO it may be a toy version for that purpose alone.
Real use might use a more robust version from another SDK with more
supported filesystems, for example. The Go SDK is de facto sort of leaning
towards that approach.


> > (3) Similarly to new runners, new SDKs should handle at least a useful
> > subset of the model, but not necessarily the whole model (at the time of
> > merge). A global-window-batch-only SDK targeting the portability
> framework,
> > for example, could be as useful a contribution in master as a full model
> SDK
> > that is supported by a direct runner only. Of course, this is not to say
> > that SDKs should not strive to support the full model, but rather -- like
> > Python streaming -- that it's fine to pursue that goal in master beyond a
> > certain point. That said, I'm curious as to why this guideline for SDKs
> was
> > set that specifically originally.
>
> While I don't think full model completeness is a feasible goal for
> merging into master (e.g. metadata driven triggers and retractions
> aren't even fully fleshed out yet), there are certain core pieces of
> the model that must be present, the notion of windowing among them. In
> addition, event-time driven windowing is one of the distinguishing
> features of Beam so it seems a regression to have SDKs without it, and
> may affect design choices like how windows and timestamps are assigned
> or inspected. Also, from a pragmatic point of view, accounting and
> tracking the fact that each element has an associated window and
> timestamp that must flow through the pipeline and taken into account
> during grouping is not something that is easily bolted on later to a
> global-window-batch-only SDK, and should be built in from the start,
> not offered as a vague promise someone will get to
> post-merge-to-master.
>
> I'd be OK supporting only a subset of WindowFns, but more than
> GlobalWindows (equivalent to "we'll just ignore the window
> everywhere") only.
>
> FWIW, streaming vs. batch is not part of the "model" per se, it's an
> operational difference. The full model was present in the SDK before
> any streaming backends were ready.
>

These are good points. I am more thinking about it from a viewpoint of what
makes a useful contribution. For runners, for example, the guidelines allow
for a narrower focus -- traditional batch is called out -- and I think that
makes sense for SDKs as well. In both cases, one focus or another might
make it harder to support the full model later, but that seems beyond the
scope of general guideline. The portability framework hopefully also makes
it less likely that any particular design choice is too expensive to change
later by the added isolation.


> > Finally, while portability support for various features -- such as side
> > input, cross-language I/O and the reference runner -- is still underway,
> > what should the guidelines be? For the Go SDK specifically, if in
> master, it
> > would bring the additional utility of helping test the portability
> framework
> > as it's being developed. On the other hand, it can't support features
> that
> > do not yet exist.
>
> Fortunately I think we have a little bit of time to get the full
> portability story into place before the Go SDK is ready to be merged.
> (On the note of helping development, I don't see anything that the Go
> SDK could offer specifically that the Python SDK can't.)
>

Indeed :). There is nothing specific the Go SDK would offer for helping
development other than better exercising the framework.


> In short, I think the list at
> https://beam.apache.org/contribute/feature-branches/ stands, with the
> additional requirement of Fn API support, and on that note (3) may be
> the (FnApi speaking) Reference Runner against which the IOs for (2)
> could be more easily satisfied.
>
> One more point of consideration, we should probably have at least one
> committer committed to (and in a position to) support it.
>

Makes sense, although I hadn't really thought about it for Go until now. What
would you suggest for new committer-less SDKs/runners, where a fair chunk
of the code would almost by construction be unfamiliar to committers? A
cool part of Beam portability IMO is that it opens the door for
non-mainstream languages to participate.


Re: Pushing daily/test containers for python

2017-12-20 Thread Henning Rohde
+1

It would be great to be able to test this aspect of portability. For
testing purposes, I think whatever container registry is convenient to use
for distribution is fine.

Regarding frequency, I think we should consider something closer to (a).
The container images -- although usually quite stable -- are part of the
SDK at that commit and are not guaranteed to work with any other version.
Breaking changes in their interaction would cause confusion and create
noise. Any local tests can also in theory just build the container images
directly and not use any registry, so it might make sense to set up the
tests so that pushing occurs less frequently then building.

Henning



On Wed, Dec 20, 2017 at 3:10 PM, Ahmet Altay  wrote:

> Hi all,
>
> After some recent changes (e.g. [1]) we have a feasible container that we
> can use to test Python SDK on portability framework. Until now we were
> using Google provided container images for testing and for the released
> product. We can gradually move away from that (at least partially) for
> Python SDK.
>
> I would like to propose building containers for testing purposes only and
> pushing them to gcr.io as part of jenkins jobs. I would like to clarify
> two points with the team first:
>
> 1. Use of GCR, I am proposing it for a few reasons:
> - Beam's jenkins workers run on GCP, and it would be easy to push them to
> gcr from there.
> - If we use another service (perhaps with a free tier for open source
> projects) we might be overusing it by pushing/pulling from our daily tests.
> - This is similar to how we stage some artifacts to GCS as part of the
> testing process.
>
> 2. Frequency of building and pushing containers
>
> a. We can run it at every PR, by integrating with python post commit tests.
> b. We can run it daily, by having a new Jenkins job.
> c. We can run it manually, by having a parameterized Jenkins job that can
> build and push a new container from a tag/commit. Given that we
> infrequently change container code, I would suggest choosing this option.
>
> What do you think about this? To be clear, this is just a proposal about
> the testing environment. I am not suggesting anything about the release
> artifacts.
>
> Thank you,
> Ahmet
>
> [1] https://github.com/apache/beam/pull/4286
>


[DISCUSS] Guidelines for merging new runners and SDKs into master

2017-12-19 Thread Henning Rohde
Hi everyone,

 As part of the Go SDK development, I was looking at the guidelines for
merging new runners and SDKs into master [1] and I think they would benefit
from being updated to reflect the emerging portability framework. Specific
suggestions:

(1) Both runners and SDKs should support the portability framework (to the
extent the model is supported by the runner/SDK). It would be
counter-productive at this time for the ecosystem to go against that effort
without a compelling reason. Direct runners not included.

(2) What are the minimal set of IO connectors a new SDK must support? Given
the upcoming cross-language feature in the portability framework, can we
rely on that to meet the requirement
without implementing any native IO connectors?

(3) Similarly to new runners, new SDKs should handle at least a useful
subset of the model, but not necessarily the whole model (at the time of
merge). A global-window-batch-only SDK targeting the portability framework,
for example, could be as useful a contribution in master as a full model
SDK that is supported by a direct runner only. Of course, this is not to
say that SDKs should not strive to support the full model, but rather --
like Python streaming -- that it's fine to pursue that goal in master
beyond a certain point. That said, I'm curious as to why this guideline for
SDKs was set that specifically originally.

Finally, while portability support for various features -- such as side
input, cross-language I/O and the reference runner -- is still underway,
what should the guidelines be? For the Go SDK specifically, if in master,
it would bring the additional utility of helping test the portability
framework as it's being developed. On the other hand, it can't support
features that do not yet exist.

What do you all think?

Thanks,
 Henning

[1] https://beam.apache.org/contribute/feature-branches/


Re: Vendor files in Go SDK

2017-12-13 Thread Henning Rohde
Thanks JB. Glad it seems like a simple thing to fix.

On Wed, Dec 13, 2017 at 9:14 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> No problem Henning !
>
> That's actually the result of gradle build, you are right.
>
> And as I ran maven build just after, rat complained about the files
> without license in go.
>
> It's the files located in sdks/go/vendor folder.
>
> I checked the .gitignore, I see vendor exclude.
>
> Let me check if git clean -x does the trick.
>
> Sorry for the noise.
>
> Regards
> JB
>
> On 12/13/2017 05:32 PM, Henning Rohde wrote:
>
>> The Go SDK doesn't ship or distribute vendor files. However, the gradle
>> build creates them during the build process. I believe I excluded these
>> from various config when I added gradle support for Go, but evidently I
>> forgot something. Sorry about the trouble. Can you clarify which files you
>> mean?
>>
>> Thanks,
>>   Henning
>>
>> On Tue, Dec 12, 2017 at 11:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>> <mailto:j...@nanthrax.net>> wrote:
>>
>> Hi guys,
>>
>> I just noticed that the Go SDK ships vendor files.
>>
>> Many of these files don't have license header.
>>
>> I would like to double check, assuming we distribute these files,
>> that they
>> are Apache license compatible (not a Cat X):
>>
>> https://www.apache.org/legal/resolved.html
>> <https://www.apache.org/legal/resolved.html>
>>
>> Can a Go SDK contributor update me about that ?
>>
>> On the other hand, the rat plugin configuration should be updated to
>> exclude
>> those vendor files. I'm preparing a PR about that.
>>
>> Thanks,
>> Regards
>> JB
>> -- Jean-Baptiste Onofré
>> jbono...@apache.org <mailto:jbono...@apache.org>
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Vendor files in Go SDK

2017-12-13 Thread Henning Rohde
The Go SDK doesn't ship or distribute vendor files. However, the gradle
build creates them during the build process. I believe I excluded these
from various config when I added gradle support for Go, but evidently I
forgot something. Sorry about the trouble. Can you clarify which files you
mean?

Thanks,
 Henning

On Tue, Dec 12, 2017 at 11:44 PM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> I just noticed that the Go SDK ships vendor files.
>
> Many of these files don't have license header.
>
> I would like to double check, assuming we distribute these files, that
> they are Apache license compatible (not a Cat X):
>
> https://www.apache.org/legal/resolved.html
>
> Can a Go SDK contributor update me about that ?
>
> On the other hand, the rat plugin configuration should be updated to
> exclude those vendor files. I'm preparing a PR about that.
>
> Thanks,
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] Beam Go SDK feature branch

2017-12-04 Thread Henning Rohde
Thanks everyone. Appreciate the feedback!

Henning

On Sat, Dec 2, 2017 at 2:03 PM, Ankur Chauhan <an...@malloc64.com> wrote:

> Hi
>
> This is amazing. Having used Java ask for over two years now and recently
> transitioned to writing go for a bunch of microservices, I really like the
> simplicity. I used gleam to do data processing and loved it. That said I
> would much rather prefer the beam sdk paradigms though.
>
>
> Ankur
>
>
> On Fri, Dec 1, 2017 at 14:39 Kenneth Knowles <k...@google.com> wrote:
>
>> This is awesome!
>>
>> With three SDKs the Beam portability vision will have a lot to offer.
>>
>> Kenn
>>
>> On Fri, Dec 1, 2017 at 10:49 AM, Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Very Exciting!
>>>
>>> The Go language is in a very different point of the design space than
>>> either Java or Python; it's interesting to see how you've explored
>>> making this fit with the Beam model. Thanks for the detailed design
>>> doc.
>>>
>>> +1 to targeting the portability framework directly. Once all runners
>>> are upgraded to use this too it'll just work everywhere.
>>>
>>> On Thu, Nov 30, 2017 at 8:51 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>> > Hi Henning,
>>> >
>>> > Thanks for the update, that's great !
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 12/01/2017 12:40 AM, Henning Rohde wrote:
>>> >>
>>> >> Hi everyone,
>>> >>
>>> >>   We have been prototyping a Go SDK for Beam for some time and have
>>> >> reached a point, where we think this effort might be of interest to
>>> the
>>> >> wider Beam community and would benefit from being developed in a
>>> proper
>>> >> feature branch. We have prepared a PR to that end:
>>> >>
>>> >> https://github.com/apache/beam/pull/4200
>>> >>
>>> >> Please note that the prototype supports batch only, for now, and
>>> includes
>>> >> various examples that run on the Go direct runner. Go would be the
>>> first SDK
>>> >> targeting the portability framework exclusively and our plan is to
>>> extend
>>> >> and benefit from that ecosystem.
>>> >>
>>> >> We have also prepared an RFC document with the initial design,
>>> motivations
>>> >> and tradeoffs made:
>>> >>
>>> >> https://s.apache.org/beam-go-sdk-design-rfc
>>> >> <https://s.apache.org/beam-go-sdk-design-rfc>
>>> >>
>>> >> The challenge is that Go is quite a tricky language for Beam due to
>>> >> various limitations, notably strong typing w/o generics, and so the
>>> >> approaches taken by Java and Python do not readily apply.
>>> >>
>>> >> Of course, neither the prototype nor the design are in any way final
>>> --
>>> >> there are many open questions and we absolutely welcome ideas and
>>> >> contributions. Please let us know if you have any comments or
>>> objections (or
>>> >> would like to help!).
>>> >>
>>> >> Thanks,
>>> >>   Henning Rohde, Bill Neubauer, and Robert Burke
>>> >
>>> >
>>> > --
>>> > Jean-Baptiste Onofré
>>> > jbono...@apache.org
>>> > http://blog.nanthrax.net
>>> > Talend - http://www.talend.com
>>>
>>
>> --
> Sent from lazarus
>


[PROPOSAL] Beam Go SDK feature branch

2017-11-30 Thread Henning Rohde
Hi everyone,

 We have been prototyping a Go SDK for Beam for some time and have reached
a point, where we think this effort might be of interest to the wider Beam
community and would benefit from being developed in a proper feature
branch. We have prepared a PR to that end:

 https://github.com/apache/beam/pull/4200

Please note that the prototype supports batch only, for now, and includes
various examples that run on the Go direct runner. Go would be the first
SDK targeting the portability framework exclusively and our plan is to
extend and benefit from that ecosystem.

We have also prepared an RFC document with the initial design, motivations
and tradeoffs made:

  https://s.apache.org/beam-go-sdk-design-rfc

The challenge is that Go is quite a tricky language for Beam due to various
limitations, notably strong typing w/o generics, and so the approaches
taken by Java and Python do not readily apply.

Of course, neither the prototype nor the design are in any way final --
there are many open questions and we absolutely welcome ideas and
contributions. Please let us know if you have any comments or objections
(or would like to help!).

Thanks,
 Henning Rohde, Bill Neubauer, and Robert Burke


Re: Questions with containerized runners plans?

2017-11-18 Thread Henning Rohde
A benefit of using docker containers is that (nearly) arbitrary native
dependencies can be installed in the container image itself by either the
user or SDK. For example, the (minimal, in progress) Python container
Dockerfile is here:


https://github.com/apache/beam/blob/1039f5b9682fa6aa5fba256110c63caf4d0da41f/sdks/python/container/Dockerfile

Any user could simply augment it with "pip install" commands, say, or use
something else entirely (although the corresponding boot program may also
need to change in that case). The Python SDK itself might also include
options/scripts/etc to make common customizations easier to use to avoid
installing them at runtime. Multiple Dockerfiles can also co-exist. For
actually passing the container image to the runner it's a choice make by
each SDK, which is why it's not discussed much in the portability context.
But a uniform flag along the lines of --sdk_harness_container_image to
include the image into the pipeline proto would seem desirable. That said,
I don't think how all these capabilities would best be exposed to users has
been much explored yet in any SDK.

Finally, there has been several thoughts on cross-language pipelines and I
think it's a very exciting aspect of the portability framework. A doc is
here:

   https://s.apache.org/beam-mixed-language-pipelines.

It is also linked from design section in the portability page.

Thanks,
 Henning


On Sat, Nov 18, 2017 at 6:33 AM, Holden Karau  wrote:

> So I was looking through https://beam.apache.org/contribute/portability/
> which lead me to BEAM-2900, and then to
> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKW
> FH9R6mtVmR7xp0/edit#
> .
>
> I was wondering if there is any considerations being given to native
> dependencies that user code may have (especially things like Python
> packages which can be super painful to deal with in a Spark cluster unless
> you use one of the vendor solutions)?
>
> Also, and this may be a terrible idea, but has there been thought given to
> the idea of a cross-language pipelines (I see these in Spark occasionally
> but with the DL stuff happening I suspect we might see users wanting
> cross-language functionality more often)?
>
> I also saw "Proposal: introduce an option to pass SDK harness container
> image in Beam SDKs" & it seems like Robert brought up the benefits of using
> Docker for Python runners, but I don't see the details on how we would
> expose that to users it in the design docs I've found yet (which could very
> well be I'm not looking at the right docs).
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Portability overview webpage

2017-11-09 Thread Henning Rohde
Thanks Holden! Do you mean whether alternatives to gRPC/protobuf are being
discussed? If so, I'm not aware of any alternative proposals.

On Wed, Nov 8, 2017 at 9:30 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Awesome! Out of interest is there any discussion around common formats for
> interchange going on?
>
> On Tue, Nov 7, 2017 at 9:15 AM, Henning Rohde <hero...@google.com.invalid>
> wrote:
>
> > Thanks everyone! The page is now live at:
> >
> >https://beam.apache.org/contribute/portability/
> >
> > Henning
> >
> > On Thu, Nov 2, 2017 at 8:22 AM, Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > This is a superb high-level overview of the effort, understandable at a
> > > glance. I think it is the first time someone has made it clear what we
> > are
> > > actually doing!
> > >
> > > Kenn
> > >
> > > On Wed, Nov 1, 2017 at 10:23 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > Thanks for the update. I will take a look.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On Nov 1, 2017, 18:21, at 18:21, Henning Rohde
> > > <hero...@google.com.INVALID>
> > > > wrote:
> > > > >Hi everyone,
> > > > >
> > > > >Although portability is a large and involved effort, it seems it
> > > > >doesn't
> > > > >have a high-level overview and plan written down anywhere. I added a
> > > > >proposed page with a 10,000 ft view and links to the webside under
> > > > >'Contribute (technical references)'. There is a page for ongoing
> > > > >projects,
> > > > >but portability is much more encompassing and seems to be more
> suited
> > > > >for
> > > > >it's own page.
> > > > >
> > > > >The PR is:
> > > > >
> > > > > https://github.com/apache/beam-site/pull/340
> > > > >
> > > > >I'm sending it out to the dev list for more visibility. Please let
> me
> > > > >know
> > > > >if you have any comments or objections, or if there is a better
> place
> > > > >for
> > > > >this content.
> > > > >
> > > > >Thanks,
> > > > > Henning
> > > >
> > >
> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Portability overview webpage

2017-11-01 Thread Henning Rohde
Hi everyone,

Although portability is a large and involved effort, it seems it doesn't
have a high-level overview and plan written down anywhere. I added a
proposed page with a 10,000 ft view and links to the webside under
'Contribute (technical references)'. There is a page for ongoing projects,
but portability is much more encompassing and seems to be more suited for
it's own page.

The PR is:

 https://github.com/apache/beam-site/pull/340

I'm sending it out to the dev list for more visibility. Please let me know
if you have any comments or objections, or if there is a better place for
this content.

Thanks,
 Henning


Re: [VOTE] Switch to new JIRA workflow for pending proposals

2017-11-01 Thread Henning Rohde
+ 1. Also +1 to Kenn's suggestion of adding a "triage" stage prior to
"open" for other bugs and features.

On Wed, Nov 1, 2017 at 9:19 AM, Lukasz Cwik 
wrote:

> +1
>
> On Wed, Nov 1, 2017 at 5:52 AM, Kenneth Knowles 
> wrote:
>
> > +1 I like it.
> >
> > This is the JIRA alternative to FLIP/KIP/HIP etc, yes? I definitely favor
> > having automation around that rather than just wiki or web page entries.
> >
> > Any thoughts on how it interacts with other types of tasks?
> >
> > Also, an analogous initial state seems useful for all bugs & features.
> > Namely, a "needs triage" state prior to "open".
> >
> > On Tue, Oct 31, 2017 at 10:23 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > +1
> > >
> > > It sounds like a good workflow to track the state of ideas/proposals.
> > >
> > > Regards
> > > JB
> > >
> > > On Nov 1, 2017, 05:43, at 05:43, Reuven Lax 
> > > wrote:
> > > >There is a new JIRA workflow for tracking pending proposals. This
> > > >workflow
> > > >is described in https://issues.apache.org/jira/browse/INFRA-12698.
> > > >Please
> > > >vote on whether you believe Beam should adopt this new workflow.
> > > >
> > > >Thanks,
> > > >
> > > >Reuven
> > >
> >
>


Re: [DISCUSS] Move away from Apache Maven as build tool

2017-10-30 Thread Henning Rohde
+1 to the initiative. It would great to have better support for Go and
Docker container images. The current Go maven integration in particular is
clunky [1], but I'll have to look into the details of the alternatives to
see if they are better.

Henning

[1] https://github.com/apache/beam/blob/master/sdks/go/BUILD.md

On Mon, Oct 30, 2017 at 10:27 AM, Kenneth Knowles 
wrote:

> I also support exploring a move away from Apache Maven for orchestrating
> our build.
>
> For a single-module project, I still think it can be a good build tool, and
> we could still use it for this sort of thing, but I think we are reaching a
> multi-module scale where it does not work well. Almost all of our jobs
> build things that are not needed and run tests that are redundant, and it
> is not easy to do better, even with a sequence of maven commands.
>
> I'd like to lay out what we hope for from a tool. Here's a start:
>
> General:
>
>  - Dependency-driven build so devs working on one thing build & test only
> what is needed
>  - Supports orchestration across Protobuf, Java, Python, Go, Docker images
>  - Meets devs where they are, letting folks in one language use familiar
> tools
>  - Caching across builds as much as possible
>  - Easily extensible for when it doesn't have the feature we need (writing
> a maven plugin is too much, using maven-exec-plugin is too crufty)
>  - Preferably a declarative configuration language
>
> Java needs beyond the basics, which could be executed by the orchestrator
> or my module-local mvn builds, etc.
>
>  - Pulling deps from maven central and alternate repos
>  - Findbugs
>  - RAT
>  - Dependency rule enforcement
>  - IWYU (Include What You Use)
>  - Multiple Java versions in same project
>  - ASF release workflow
>
> I probably missed some must-haves or nice-to-haves. I'd love to compile
> thoughts on other languages' needs.
>
> Based on these, another project I would consider is Bazel. We could very
> easily use it to orchestrate, but use Maven (or Gradle!) on the leaves. I
> also think that Gradle is also more focused on the JVM ecosystem, so it is
> not quite as neutral as Bazel, and uses Groovy which is a bit more esoteric
> than Python for writing Bazel rules.
>
> Kenn
>
> On Mon, Oct 30, 2017 at 9:37 AM, Lukasz Cwik 
> wrote:
>
> > I wanted to make this thread more visible. This discussion stems from
> Ken's
> > thread about Jenkins pre/post commit issues[1].
> >
> > I did some investigation as for ways to improve the quality of the signal
> > from Jenkins by trying to modify the Jenkins jobs spawned from Groovy. I
> > had limited success but everything I felt like I was doing was just
> > patching symptoms of the problem which is that our build is just too
> slow.
> > For example, we keep adding all these profiles to Maven or tuning how a
> > plugin runs to eek out a small decrease in build time. I believe swapping
> > away from Apache Maven to a build tool which only builds the things which
> > have changed in a PR would be the best approach.
> >
> > I would suggest that we migrate to Gradle as our build tool. I am
> > suggesting Gradle because:
> > * It is used in lots of open source projects and has a very large
> community
> > behind it.
> > * It has better support for building languages other then Java
> > (PyGradle[2], GoGradle[3], ...)
> > * Its incremental build support works and only builds things that changed
> > through the use of a build cache. Even without the build cache (or for
> > clean builds), it is much faster.
> > * Apache Infra already has Gradle v4.x installed on the Jenkins machines.
> >
> > Any alternatives that should be considered or should we stick with Apache
> > Maven?
> >
> > 1:
> > https://lists.apache.org/thread.html/25311e0e95be5c49afb168d9b4b4d3
> > 57984c10c39c7b01da8ff3baaf@%3Cdev.beam.apache.org%3E
> > 2: https://github.com/linkedin/pygradle
> > 3: https://github.com/gogradle/gogradle
> >
>


Re: Portability JIRA organization

2017-09-11 Thread Henning Rohde
Thanks Kenn and Aljoscha! I've added a bunch of issues and reorganized some
of existing ones as per the proposal. Of course, anyone should feel free to
make changes if something is off or doesn't make sense.

I would like to call out a few of the new issues:

   * BEAM-2899: "Universal Local Runner".
   * BEAM-2896: "Portability milestone: wordcount runs everywhere".
   * BEAM-2941: "Portability milestone: windowed wordcount runs everywhere".
   * BEAM-2940: "Portability milestone: mobile gaming runs everywhere".

The first is a local reference implementation for the portability API. The
latter three are intended to track the bigger milestones that all SDKs and
runners implement enough of the portability framework to run X.

Thanks,
 Henning



On Fri, Sep 8, 2017 at 4:42 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> +1
>
> Sounds great!
>
> > On 9. Sep 2017, at 00:44, Kenneth Knowles <k...@google.com.INVALID>
> wrote:
> >
> > +1 to this.
> >
> > It is really easy to lose track of things in a sea of tickets, and
> > portability touches every SDK and runner, so getting this organized will
> be
> > hugely helpful. Especially getting some clear higher-level milestones (by
> > moving tickets to subtasks) we can work towards and announce when done
> will
> > be great.
> >
> > Kenn
> >
> > On Fri, Sep 8, 2017 at 1:59 PM, Henning Rohde <hero...@google.com.invalid
> >
> > wrote:
> >
> >> Hi everyone,
> >>
> >> The portability effort involves a large amount of work cutting across
> all
> >> SDKs and runners [e..g, 1,2,3,4]. It seems to be only partially
> captured in
> >> JIRAs, so I'd like to volunteer to try to flesh it out further. In
> addition
> >> to adding JIRAs for missing work, I was thinking of organizing it as
> >> follows:
> >>
> >>  (1) Designs, proto definitions and the like belong in the "beam-model"
> >> component. For example, BEAM-2862 tracks support for User State over
> the Fn
> >> API.
> >>  (2) Dependent work are subtasks belong in the component where the work
> is
> >> needed. For example, making the Python SDK harness use the User State
> over
> >> the Fn API would be in the "sdk-py" component and be a subtask (or
> >> otherwise linked to) the overall BEAM-2862 issue. Similarly for each
> runner
> >> and other SDKs.
> >>  (3) All portability issues should use a "portability" label, regardless
> >> of component, to identify the overall effort.
> >>
> >> The aim is to make it more clear to see what works (and remains to be
> done)
> >> from both the point of view of each SDK and runner as well as
> feature-wise
> >> for portability.
> >>
> >> Please let me know if you have any comments or objections.
> >>
> >> Thanks,
> >> Henning
> >>
> >> [1] https://s.apache.org/beam-fn-api
> >> [2] https://s.apache.org/beam-runner-api
> >> [3] https://s.apache.org/beam-job-api
> >> [4] https://s.apache.org/beam-fn-api-container-contract
> >>
>
>


Portability JIRA organization

2017-09-08 Thread Henning Rohde
Hi everyone,

 The portability effort involves a large amount of work cutting across all
SDKs and runners [e..g, 1,2,3,4]. It seems to be only partially captured in
JIRAs, so I'd like to volunteer to try to flesh it out further. In addition
to adding JIRAs for missing work, I was thinking of organizing it as
follows:

  (1) Designs, proto definitions and the like belong in the "beam-model"
component. For example, BEAM-2862 tracks support for User State over the Fn
API.
  (2) Dependent work are subtasks belong in the component where the work is
needed. For example, making the Python SDK harness use the User State over
the Fn API would be in the "sdk-py" component and be a subtask (or
otherwise linked to) the overall BEAM-2862 issue. Similarly for each runner
and other SDKs.
  (3) All portability issues should use a "portability" label, regardless
of component, to identify the overall effort.

The aim is to make it more clear to see what works (and remains to be done)
from both the point of view of each SDK and runner as well as feature-wise
for portability.

Please let me know if you have any comments or objections.

Thanks,
 Henning

[1] https://s.apache.org/beam-fn-api
[2] https://s.apache.org/beam-runner-api
[3] https://s.apache.org/beam-job-api
[4] https://s.apache.org/beam-fn-api-container-contract