Jenkins build is back to normal : beam_SeedJob #1110

2018-02-16 Thread Apache Jenkins Server
See

Re: [PROPOSAL] Add a blog post for every new release

2018-02-16 Thread Eugene Kirpichov
We could also create a GitHub label for PRs that should be looked at when crafting the next release notes, applied per committer discretion. On Fri, Feb 16, 2018, 2:36 PM Robert Bradshaw wrote: > Huge +1 to proper release notes, which may make sense as blog posts as > well

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-16 Thread Jean-Baptiste Onofré
Hi, Can someone from Python grand me permission to upload Python SDK 2.3.0 to PyPi ? My user is jbonofre. Thanks ! Regards JB On 02/16/2018 03:40 AM, Jean-Baptiste Onofré wrote: > Great !!! > > Thanks for the update, I will close the vote then. > > Regards > JB > > On 02/15/2018 11:45 PM,

@TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
Hi guys, I'm a bit concerned of this PR https://github.com/apache/beam/pull/4637 I understand the intent but I'd like to share how I see it and why it is an issue for me: 1. you can't help if the JVM crash in any case. Tomcat had a try to preallocate some memory for instance to free it in case

Re: @TearDown guarantees

2018-02-16 Thread Jean-Baptiste Onofré
Hi Romain Is it not @FinishBundle your solution ? Regards JB Le 16 févr. 2018 à 17:06, à 17:06, Romain Manni-Bucau a écrit: >I see Reuven, so it is actually a broken contract for end users more >than a >bug. Concretely a user must have a way to execute code once the

multi-env var representation for pipeline options

2018-02-16 Thread Romain Manni-Bucau
Hi guys, this is a followup of the thread on the pipeline options from system properties discussion the context of this new thread is adding portability into the game and therefore if we want a common and shared way to create pipeline options from the environment since it impacts

Re: [PROPOSAL] Add a blog post for every new release

2018-02-16 Thread Ismaël Mejía
As discussed in this thread I created an initial version of a document for the release notes. Feel free to add/include details that you consider worth (+correct my english mistakes) or new sections/ideas. Remember the idea is to have a concise document of the most important changes in this

Re: PipelineOptions fromSystemProps?

2018-02-16 Thread Kenneth Knowles
> On Thu, Feb 15, 2018, 11:47 PM Romain Manni-Bucau > wrote: > >> >> 2018-02-15 20:00 GMT+01:00 Kenneth Knowles : >> >>> On Thu, Feb 15, 2018 at 12:03 AM, Romain Manni-Bucau < >>> rmannibu...@gmail.com> wrote: 2. default properties = env + system

Re: @TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
I see Reuven, so it is actually a broken contract for end users more than a bug. Concretely a user must have a way to execute code once the teardown is no more used and a teardown is populated by the user in the context of an execution. It means that if the environment wants to pool (cache) the

Re: PipelineOptions fromSystemProps?

2018-02-16 Thread Romain Manni-Bucau
Oh, so the point was than env would be under the portability umbrella versus system properties are not? Kind of makes sense phrased this way for me. Do we want another thread for that? Romain Manni-Bucau @rmannibucau | Blog

Re: @TearDown guarantees

2018-02-16 Thread Reuven Lax
On Fri, Feb 16, 2018 at 8:06 AM, Romain Manni-Bucau wrote: > I see Reuven, so it is actually a broken contract for end users more than > a bug. Concretely a user must have a way to execute code once the teardown > is no more used and a teardown is populated by the user in

Re: PipelineOptions fromSystemProps?

2018-02-16 Thread Kenneth Knowles
Good idea to split the discussion. We can complete PipelineOptions.fromSystemProperties immediately for Java and separate the cross-language "multi-env var representation" for pipeline options. Probably Python & Go fans have muted this thread, after all. On Fri, Feb 16, 2018 at 8:07 AM, Romain

Re: @TearDown guarantees

2018-02-16 Thread Kenneth Knowles
What I am hearing is this: - @FinishBundle does what you want (a reliable "flush" call) but your runner is not doing a good job of bundling - @Teardown has well-defined semantics and they are not what you want So you are hoping for something that is called less frequently but is still

Re: @TearDown guarantees

2018-02-16 Thread Reuven Lax
So the concern is that @TearDown might not be called? Let's understand the reason for @TearDown. The runner is free to cache the DoFn object across many invocations, and indeed in streaming this is often a critical optimization. However if the runner does decide to destroy the DoFn object (e.g.

Re: @TearDown guarantees

2018-02-16 Thread Kenneth Knowles
It sounds like you just want @FinishBundle On Fri, Feb 16, 2018 at 8:06 AM, Romain Manni-Bucau wrote: > I see Reuven, so it is actually a broken contract for end users more than > a bug. Concretely a user must have a way to execute code once the teardown > is no more used

Re: @TearDown guarantees

2018-02-16 Thread Reuven Lax
+1 I think @FinishBundle is the right thing to look at here. On Fri, Feb 16, 2018, 8:41 AM Jean-Baptiste Onofré wrote: > Hi Romain > > Is it not @FinishBundle your solution ? > > Regards > JB > Le 16 févr. 2018, à 17:06, Romain Manni-Bucau a > écrit:

Re: PipelineOptions fromSystemProps?

2018-02-16 Thread Romain Manni-Bucau
will create it now, thanks Romain Manni-Bucau @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book

Re: @TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
finish bundle is well defined and must be called, right, not at the end so you still miss teardown as a user. Bundles are defined by the runner and you can have 10 bundles per batch (even more for a stream ;)) so you dont want to release your resources or handle you execution auditing in it,

Re: @TearDown guarantees

2018-02-16 Thread Reuven Lax
@TearDown refers to DoFn teardown not process teardown (it's basically a destructor). So it's also runner defined. There may be a place for a container that lives as long as the process (not tied to the DoFn life). However that would be something new to add. On Fri, Feb 16, 2018, 8:52 AM Romain

Re: @TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
2018-02-16 17:59 GMT+01:00 Kenneth Knowles : > What I am hearing is this: > > - @FinishBundle does what you want (a reliable "flush" call) but your > runner is not doing a good job of bundling > Nop, finishbundle is defined but not a bundle. Typically for 1 million rows I'll

Re: PipelineOptions fromSystemProps?

2018-02-16 Thread Reuven Lax
I don't have a strong feeling about null here (though we're Java 8 now - can't we start using Optional instead?). However it's always easier to add things and harder to remove things. If it's not really needed now, id vote to leave it out of this PR and add later if necessary. On Thu, Feb 15,

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-16 Thread Robert Bradshaw
Yes it does work for Java pipelines, modulo https://github.com/apache/beam/pull/4211 . I'm actually not sure what the performance characteristics are; but I'm sure it's not as dramatic as improvement (if any) compared to what we see in Python. It's great for development though. On Fri, Feb 16,

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-16 Thread Marián Dvorský
Does the same runner work for Java pipelines? (I assume so, given that it uses portability framework.) If so, does it provide similar speedup? On Fri, Feb 16, 2018 at 7:37 PM Robert Bradshaw wrote: > If there are no concerns, I say let's merge this. > > On Fri, Feb 16, 2018

Re: multi-env var representation for pipeline options

2018-02-16 Thread Kenneth Knowles
I'd like to focus on getting feedback on the part of your proposal where - we treat a collection of env vars under a shared prefix as textual pipeline options - every SDK has compatible parsing of the option names and values Beyond that, most use cases should use their own namespace prefix

Re: @TearDown guarantees

2018-02-16 Thread Thomas Groh
I'll note as well that you don't need a well defined DoFn lifecycle method - you just want less granular bundling, which is a different requirement. Teardown has well-defined interactions with the rest of the DoFn methods, and what the runner is permitted to do when it calls Teardown - the fact

Re: @TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
So do I get it right a leak of Dataflow implementation impacts the API? Also sounds like this perf issues is due to a blind serialization instead of modelizing what is serialized - nothing should be slow enough in the serialization at that level, do you have more details on that particular point?

Re: @TearDown guarantees

2018-02-16 Thread Thomas Groh
On perf: Deserialization of an arbitrary object is expensive. This cost is amortized over all of the elements that the object processes, but for a runner with small bundles, that cost never gets meaningfully amortized - deserializing a DoFn instance of unknown complexity to process one element

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-16 Thread Ben Sidhom
To get Flink 1.4.0 installed on Dataproc, I used my own init action. After starting a detached flink session (e.g, HADOOP_CONF_DIR=/etc/hadoop/conf /usr/lib/flink/bin/yarn-session.sh -n 4 -tm 4000 -d), I pulled the Flink Job Manager address and port from the Application Master web UI. After that,

Re: @TearDown guarantees

2018-02-16 Thread Kenneth Knowles
On Fri, Feb 16, 2018 at 9:39 AM, Romain Manni-Bucau wrote: > > 2018-02-16 18:18 GMT+01:00 Kenneth Knowles : > >> Which runner's bundling are you concerned with? It sounds like the Flink >> runner? >> > > Flink, Spark, DirectRunner, DataFlow at least (others

[RESULT][VOTE] Release 2.3.0, release candidate #3

2018-02-16 Thread Jean-Baptiste Onofré
Hi all, I'm happy to announce that we have unanimously approved this release. There are 9 approving votes, 6 of which are binding: * Jean-Baptiste Onofré * Ismaël Mejia * Lukasz Cwik * Reuven Lax * Ahmet Altay * Robert Bradshaw There are no disapproving votes. Thanks everyone! Regards JB On

Re: @TearDown guarantees

2018-02-16 Thread Thomas Groh
Given that I'm the original author of both the @Setup and @Teardown methods and the PR under discussion, I thought I'd drop in to give in a bit of history and my thoughts on the issue. Originally (Dataflow 1.x), the spec required a Runner to deserialize a new instance of a DoFn for every Bundle.

Re: @TearDown guarantees

2018-02-16 Thread Kenneth Knowles
Which runner's bundling are you concerned with? It sounds like the Flink runner? Kenn On Fri, Feb 16, 2018 at 9:04 AM, Romain Manni-Bucau wrote: > > 2018-02-16 17:59 GMT+01:00 Kenneth Knowles : > >> What I am hearing is this: >> >> - @FinishBundle does

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-16 Thread Charles Chen
I hope those interested have had time to test this out. I have sent out https://github.com/apache/beam/pull/4696 to switch to using this fast runner as the default DirectRunner for local execution. Let me know if there are any concerns. On Tue, Feb 13, 2018 at 12:17 PM Charles Chen

Re: A 15x speed-up in local Python DirectRunner execution

2018-02-16 Thread Robert Bradshaw
If there are no concerns, I say let's merge this. On Fri, Feb 16, 2018 at 9:39 AM, Charles Chen wrote: > I hope those interested have had time to test this out. I have sent out > https://github.com/apache/beam/pull/4696 to switch to using this fast runner > as the default

Re: @TearDown guarantees

2018-02-16 Thread Reuven Lax
Kenn is correct. Allowing Fn reuse across bundles was a major, major performance improvement. Profiling on the old Dataflow SDKs consistently showed Java serialization being the number one performance bottleneck for streaming pipelines, and Beam fixed this. Romain - can you state precisely what

Re: @TearDown guarantees

2018-02-16 Thread Romain Manni-Bucau
Le 16 févr. 2018 19:28, "Kenneth Knowles" a écrit : On Fri, Feb 16, 2018 at 9:39 AM, Romain Manni-Bucau wrote: > > 2018-02-16 18:18 GMT+01:00 Kenneth Knowles : > >> Which runner's bundling are you concerned with? It sounds like the Flink

Re: @TearDown guarantees

2018-02-16 Thread Kenneth Knowles
On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau wrote: > > The serialization of fn being once per bundle, the perf impact is only > huge if there is a bug somewhere else, even java serialization is > negligeable on big config compared to any small pipeline (seconds vs >

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-16 Thread Eugene Kirpichov
Thanks Ben! On Fri, Feb 16, 2018 at 10:06 AM Ben Sidhom wrote: > To get Flink 1.4.0 installed on Dataproc, I used my own init action. After > starting a detached flink session (e.g, HADOOP_CONF_DIR=/etc/hadoop/conf > /usr/lib/flink/bin/yarn-session.sh -n 4 -tm 4000 -d), I

Re: [PROPOSAL] Add a blog post for every new release

2018-02-16 Thread Eugene Kirpichov
Thanks Ismael - I've added a couple of the major things I know. I tried scanning the whole git shortlog for major things, but it was too much, so probably better if individual committers make their contributions to the document. On Fri, Feb 16, 2018 at 9:01 AM Ismaël Mejía

Re: [PROPOSAL] Add a blog post for every new release

2018-02-16 Thread Robert Bradshaw
Huge +1 to proper release notes, which may make sense as blog posts as well as the email announcement. Yes, we have the Jira "release notes," but they are only one step above the git commit logs, and generally hard to digest (especially for an outsider). For example, if I go to

Build failed in Jenkins: beam_SeedJob #1109

2018-02-16 Thread Apache Jenkins Server
See -- GitHub pull request #4677 of commit 12285eb726db506dd8505621a4a51bf2ec12bf14, no merge conflicts. Setting status of 12285eb726db506dd8505621a4a51bf2ec12bf14 to PENDING with url