Re: [VOTE] Policies for managing Beam dependencies

2018-06-06 Thread Chamikara Jayalath
Hi Kenn,

On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles  wrote:

> +0.5
>
> I like the spirit of these policies. I think they need a little wording
> work. Comments inline.
>
> On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath 
>> wrote:
>>>
>>>
>>> (1) Human readable reports on status of Beam dependencies are generated
>>> weekly and shared with the Beam community through the dev list.
>>>
>>
> Who is responsible for generating these? The mechanism or responsibility
> should be made clear.
>
> I clicked through a doc -> thread -> doc to find even some details. It
> looks like manual run of a gradle command was adopted. So the
> responsibility needs an owner, even if it is "unspecified volunteer on dev@
> and feel free to complain or do it yourself if you don't see it"
>

This is described in following doc (referenced by my doc).
https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRXsO72BpBwA/edit#

Proposal is to run an automated Jenkins job that is run weekly, so no need
for someone to manually generate these reports.


>
> (2) Beam components should define dependencies and their versions at the
>>> top level.
>>>
>>
> I think the big "should" works better with some guidance about when
> something might be an exception, or at least explicit mention that there
> can be rare exceptions. Unless you think that is never the case. If there
> are no exceptions, then say "must" and if we hit a roadblock we can revisit
> the policy.
>

The idea was to allow exceptions. Added more details to the doc.


>
>
> (3) A significantly outdated dependency (identified manually or through
>>> tooling) should result in a JIRA that is a blocker for the next release.
>>> Release manager may choose to push the blocker to the subsequent release or
>>> downgrade from a blocker.
>>>
>>
> How is "significantly outdated" defined? By dev@ discussion? Seems like
> the right way. Anyhow that's what will happen in practice as people debate
> the blocker bug.
>

This will be either through the automated Jenkins job (see the doc above,
where the proposal is to flag new major versions and new minor versions
that are more than six months old) or manually (for any critical updates
that will not be captured by the Jenkins job) (more details in the doc).
Manually identified critical dependency updates may involve a discussion in
the dev list.


>
>
> (4) Dependency declarations may identify owners that are responsible for
>>> upgrading the respective dependencies.
>>>
>>> (5) Dependencies of Java SDK components that may cause issues to other
>>> components if leaked should be shaded.
>>>
>>
> We previously agreed upon our intent to migrate to "pre-shaded" aka
> "vendored" packages:
> https://lists.apache.org/thread.html/12383d2e5d70026427df43294e30d6524334e16f03d86c9a5860792f@%3Cdev.beam.apache.org%3E
>
> With Maven, this involved a lot of boilerplate so I never did it. With
> Gradle, we can easily build a re-usable rule to create such a package in a
> couple of lines. I just opened the first WIP PR here:
> https://github.com/apache/beam/pull/5570 it is blocked by deleting the
> poms anyhow so by then we should have a configuration that works to vendor
> our currently shaded artifacts.
>
> So I think this should be rephrased to "should be vendored" so we don't
> have to revise the policy.
>

Thanks for the pointer. I agree that vendoring is a good approach.

Here are the updated policies (and more details added to doc). I agree with
Ahmet's point that votes should be converted to web sites where we can give
more details and examples.

(1.a) Human readable reports on status of Beam dependencies are generated
weekly by an automated Jenkins job and shared with the Beam community
through the dev list.

(2.a) Beam components should define dependencies and their versions at the
top level. There can be rare exceptions, but they should come with
explanations.

(3.a) A significantly outdated dependency (identified manually or through
the automated Jenkins job) should result in a JIRA that is a blocker for
the next release. Release manager may choose to push the blocker to the
subsequent release or downgrade from a blocker.

(4.a) Dependency declarations may identify owners that are responsible for
upgrading the respective dependencies.

(5.a) Dependencies of Java SDK components that may cause issues to other
components if leaked should be vendored.


Thanks,
Cham


>
> Kenn
>
>
>
>> Please vote:
>>> [ ] +1, Approve that we adapt these policies
>>> [ ] -1, Do not approve (please provide specific comments)
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/8738c13ad7e576bc2fef158d2cc6f809e1c238ab8d5164c78484bf54@%3Cdev.beam.apache.org%3E
>>> [2]
>>> https://docs.google.com/document/d/15m1MziZ5TNd9rh_XN0YYBJfYkt0Oj-Ou9g0KFDPL2aA/edit?usp=sharing
>>>
>>
>>


Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-06 Thread Romain Manni-Bucau
Note sure the example is atomic enough but in
https://github.com/Talend/component-runtime/blob/master/component-runtime-manager/src/main/java/org/talend/sdk/component/runtime/manager/finder/StandaloneContainerFinder.java#L40
the "instance()" is a singleton used by all the runtime of the framework.

Deserialization happens in
https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/serialization/SerializableService.java#L26
and serialization is about creating this object in a write replace. Then
the runtime is switching its classloader (runner for beam) as in
https://github.com/Talend/component-runtime/blob/master/component-runtime-impl/src/main/java/org/talend/sdk/component/runtime/base/LifecycleImpl.java#L60
asap and resets it once done to not break its environment for reused jvms
case.

If we take the case of an IO, the io would lazily creates its defined
classloader from its spec and use some reference counting logic to destroy
it when needed in its teardown for instance. The io then does the
classloader switch in its callbacks (setup/teardown/process/bundle hooks
etc).


Le mer. 6 juin 2018 23:33, Lukasz Cwik  a écrit :

> Romain, can you point to an example of a global singleton registry that
> does this right for class loading (it may allow people to work towards such
> an effort)?
>
> On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau 
> wrote:
>
>> It is actually very localised in runner code where beam should reset the
>> classloader when the deserialization happens and then the runner owns the
>> classloader all the way in evaluators.
>>
>> If IO change the classloader they must indeed handle it too and patch the
>> deserialization too.
>>
>> Here again (we mentionned it multiple times in other threads) beam misses
>> a global singleton registry where you can register a "service" to look it
>> up based of a serialization configuration and a lifecycle allowing to close
>> the classloader in all instances without hacks.
>>
>>
>> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>>
>>> Perhaps we can also adopt a practice of making our own APIs explicitly
>>> pass a Classloader when appropriate so we only have to set this when we are
>>> entering code that does not have good hygiene. It might actually be nice to
>>> have a lightweight static analysis to forbid bad methods in our code.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>>>
 I totally agree, but there are so many Java APIs (including ours) that
 messed this up so everyone lives with the same hack.

 On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
 wrote:

> It seems like a terribly fragile way to pass arguments but my tests
> pass when I wrap the JDBC path into Beam pipeline execution with that
> pattern.
>
> Thanks!
>
> Andrew
>
> On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:
>
>> It is a common mistake for APIs to not include a way to specify which
>> class loader to use when doing something like deserializing an instance 
>> of
>> a class via the ObjectInputStream. This common issue also affects Apache
>> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
>> typical Java APIs have gotten around this is to use to the thread context
>> class loader (TCCL) as the way to plumb this additional attribute 
>> through.
>> So Apache Beam is meant to in all places honor the TCCL if it has been 
>> set
>> as most Java libraries (not all) do the same hack.
>>
>> In most environments the TCCL is not set and we are working with a
>> single class loader. It turns out that in more complicated environments
>> (like when loading a JDBC driver, or JNDI, or an application server, ...)
>> this usually doesn't work without each caller knowing what class loading
>> context they should be in. A common work around for most scenarios is to
>> always set the TCCL to the current classes class loader like so before
>> invoking any APIs that do class loading so you don't propagate the TCCL 
>> of
>> the caller along since they may have set it for some other reason:
>>
>> ClassLoader originalClassLoader = 
>> Thread.currentThread().getContextClassLoader();try {
>> 
>> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
>> // call some API that uses reflection without taking ClassLoader 
>> param} finally {
>> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>>
>>
>>
>> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
>> wrote:
>>
>>> I'm having class loading issues that go away when I revert the
>>> changes in our use of Class.forName added in
>>> https://github.com/apache/beam/pull/4674. The problem I'm having is
>>> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creat

Re: Beam SQL Pipeline Options

2018-06-06 Thread arun kumar
Hi

Thanks for the update.
Can you please share me if you have any documentation for connecting to
postgres using beam code.

Thanks
Arun

On Wed, Jun 6, 2018, 9:54 PM Andrew Pilloud  wrote:

> We are just about to the point of having a working pure SQL workflow for
> Beam! One of the last things that remains is how to configure Pipeline
> Options via a SQL shell. I have written up a proposal to use the set
> statement, for example "SET runner=DataflowRunner". I'm looking for
> feedback, particularly on what will make for the best user experience.
> Please take a look and comment:
>
>
> https://docs.google.com/document/d/1UTsSBuruJRfGnVOS9eXbQI6NauCD4WnSAPgA_Y0zjdk/edit?usp=sharing
>
> Andrew
>


Re: Beam SQL Pipeline Options

2018-06-06 Thread Kenneth Knowles
This is a nice short design discussion doc, and perhaps a cooler piece of
news hidden in the paragraph :-)

Kenn

On Wed, Jun 6, 2018 at 9:24 AM Andrew Pilloud  wrote:

> We are just about to the point of having a working pure SQL workflow for
> Beam! One of the last things that remains is how to configure Pipeline
> Options via a SQL shell. I have written up a proposal to use the set
> statement, for example "SET runner=DataflowRunner". I'm looking for
> feedback, particularly on what will make for the best user experience.
> Please take a look and comment:
>
>
> https://docs.google.com/document/d/1UTsSBuruJRfGnVOS9eXbQI6NauCD4WnSAPgA_Y0zjdk/edit?usp=sharing
>
> Andrew
>


Re: Announcement & Proposal: HDFS tests on large cluster.

2018-06-06 Thread Kenneth Knowles
This is rad. Another +1 from me for a bigger cluster. What do you need to
make that happen?

Kenn

On Wed, Jun 6, 2018 at 10:16 AM Pablo Estrada  wrote:

> This is really cool!
>
> +1 for having a cluster with more than one machine run the test.
>
> -P.
>
> On Wed, Jun 6, 2018 at 9:57 AM Chamikara Jayalath 
> wrote:
>
>> On Wed, Jun 6, 2018 at 5:19 AM Łukasz Gajowy 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to announce that thanks to Kamil Szewczyk, since this PR
>>>  we have 4 file-based HDFS
>>> tests run on a "Large HDFS Cluster"! More specifically I mean:
>>>
>>> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
>>> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
>>> - beam_PerformanceTests_AvroIOIT_HDFS
>>> - beam_PerformanceTests_XmlIOIT_HDFS
>>>
>>> The "Large HDFS Cluster" (in contrast to the small one, that is also
>>> available) consists of a master node and three data nodes all in separate
>>> pods. Thanks to that we can mimic more real-life scenarios on HDFS (3
>>> distributed nodes) and possibly run bigger tests so there's progress! :)
>>>
>>>
>> This is great. Also, looks like results are available in test dashboard:
>> https://apache-beam-testing.appspot.com/explore?dashboard=5755685136498688
>> (BTW we should add information about dashboard to the testing doc:
>> https://beam.apache.org/contribute/testing/)
>>
>> I'm currently working on proper documentation for this so that everyone
>>> can use it in IOITs (stay tuned).
>>>
>>> Regarding the above, I'd like to propose scaling up the
>>> Kubernetes cluster. AFAIK, currently, it consists of 1 node. If we scale it
>>> up to eg. 3 nodes, the HDFS' kubernetes pods will distribute themselves on
>>> different machines rather than one, making it an even more "real-life"
>>> scenario (possibly more efficient?). Moreover, other Performance Tests
>>> (such as JDBC or mongo) could use more space for their infrastructure as
>>> well. Scaling up the cluster could also turn out useful for some future
>>> efforts, like BEAM-4508[1] (adapting and running some old IOITs on
>>> Jenkins).
>>>
>>> WDYT? Are there any objections?
>>>
>> +1 for increasing the size of Kubernetes cluster.
>>
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-4508
>>>
>>> --
> Got feedback? go/pabloem-feedback
> 
>


Re: [Proposal] Apache Beam's Public Project Roadmap

2018-06-06 Thread Kenneth Knowles
This is great. I'm really excited to build a community process for this.

Kenn

On Wed, Jun 6, 2018 at 5:05 PM Griselda Cuevas  wrote:

> Hi Beam Community,
>
> I'd like to propose the creation of a Public Project Roadmap. Here are the
> details as well as some artifacts I started already.
>
> ---_---
>
> *Proposal *
>
>
> *What?*
>
> Create a simple spreadsheet-based project roadmap to generate an overview
> of all the efforts driven by different members of the community. I propose
> doing something like this tracker I put together
> 
>  [1].
>
>
> *Why?*
>
> *The ultimate purpose of this roadmap is to create a tool and method to
> make our project more transparent and efficient. It will help us map
> dependencies, identify collaboration areas and communicate future releases
> to users more clearly. *
>
>
> *How?*
>
> We could update the current status of the roadmap every three months
> (quarter), following these guidelines and timeline
> 
>  [2].
>
>
> Thanks!
> G
>
> [1]
> https://docs.google.com/spreadsheets/d/1W6xvPmGyG8Nd9R7wkwgwRJvZdyLoBg6F3NCrPafbKmk/edit#gid=0
>
>
> [2]
> https://docs.google.com/document/d/1jzASbp_5GrZdsF5YXau-MPh_v0jW6egJwL_Pyac5zeE/edit#heading=h.6dyyhz5krl9v
>
>
>


Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Kenneth Knowles
+1

Definitely a good opportunity to decouple your build tools from your
dependencies' build tools.

On Wed, Jun 6, 2018 at 2:42 PM Ted Yu  wrote:

> +1 on this effort
>
>  Original message 
> From: Chamikara Jayalath 
> Date: 6/6/18 2:09 PM (GMT-08:00)
> To: dev@beam.apache.org, u...@beam.apache.org
> Subject: Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml)
> grace period?
>
> +1 for the overall effort. As Pablo mentioned, we need some time to
> migrate internal Dataflow build off of Maven build files. I created
> https://issues.apache.org/jira/browse/BEAM-4512 for this.
>
> Thanks,
> Cham
>
> On Wed, Jun 6, 2018 at 1:30 PM Eugene Kirpichov 
> wrote:
>
>> Is it possible for Dataflow to just keep a copy of the pom.xmls and
>> delete it as soon as Dataflow is migrated?
>>
>> Overall +1, I've been using Gradle without issues for a while and almost
>> forgot pom.xml's still existed.
>>
>> On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:
>>
>>> I agree that we should delete the pom.xml files soon, as they create a
>>> burden for maintainers.
>>>
>>> I'd like to be able to extend the grace period by a bit, to allow the
>>> internal build systems at Google to move away from using the Beam poms.
>>>
>>> We use these pom files to build Dataflow workers, and thus it's critical
>>> for us that they are available for a few more weeks while we set up a
>>> gradle build. Perhaps 4 weeks?
>>> (Calling out+Chamikara Jayalath  who has recently
>>> worked on internal Dataflow tooling.)
>>>
>>> Best
>>> -P.
>>>
>>> On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:
>>>
 Note: Apache Beam will still provide pom.xml for each release it
 produces. This is only about people using Maven to build Apache Beam
 themselves and not relying on the released artifacts in Maven Central.

 With the first release using Gradle as the build system is underway, I
 wanted to start this thread to remind people that we are going to delete
 the Maven pom.xml files after the 2.5.0 release is finalized plus a two
 week grace period.

 Are there others who would like a shorter/longer grace period?

 The PR to delete the pom.xml is here:
 https://github.com/apache/beam/pull/5571

>>> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>


Re: [VOTE] Policies for managing Beam dependencies

2018-06-06 Thread Kenneth Knowles
+0.5

I like the spirit of these policies. I think they need a little wording
work. Comments inline.

On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath 
> wrote:
>>
>>
>> (1) Human readable reports on status of Beam dependencies are generated
>> weekly and shared with the Beam community through the dev list.
>>
>
Who is responsible for generating these? The mechanism or responsibility
should be made clear.

I clicked through a doc -> thread -> doc to find even some details. It
looks like manual run of a gradle command was adopted. So the
responsibility needs an owner, even if it is "unspecified volunteer on dev@
and feel free to complain or do it yourself if you don't see it"


(2) Beam components should define dependencies and their versions at the
>> top level.
>>
>
I think the big "should" works better with some guidance about when
something might be an exception, or at least explicit mention that there
can be rare exceptions. Unless you think that is never the case. If there
are no exceptions, then say "must" and if we hit a roadblock we can revisit
the policy.


(3) A significantly outdated dependency (identified manually or through
>> tooling) should result in a JIRA that is a blocker for the next release.
>> Release manager may choose to push the blocker to the subsequent release or
>> downgrade from a blocker.
>>
>
How is "significantly outdated" defined? By dev@ discussion? Seems like the
right way. Anyhow that's what will happen in practice as people debate the
blocker bug.


(4) Dependency declarations may identify owners that are responsible for
>> upgrading the respective dependencies.
>>
>> (5) Dependencies of Java SDK components that may cause issues to other
>> components if leaked should be shaded.
>>
>
We previously agreed upon our intent to migrate to "pre-shaded" aka
"vendored" packages:
https://lists.apache.org/thread.html/12383d2e5d70026427df43294e30d6524334e16f03d86c9a5860792f@%3Cdev.beam.apache.org%3E

With Maven, this involved a lot of boilerplate so I never did it. With
Gradle, we can easily build a re-usable rule to create such a package in a
couple of lines. I just opened the first WIP PR here:
https://github.com/apache/beam/pull/5570 it is blocked by deleting the poms
anyhow so by then we should have a configuration that works to vendor our
currently shaded artifacts.

So I think this should be rephrased to "should be vendored" so we don't
have to revise the policy.

Kenn



> Please vote:
>> [ ] +1, Approve that we adapt these policies
>> [ ] -1, Do not approve (please provide specific comments)
>>
>> Thanks,
>> Cham
>>
>> [1]
>> https://lists.apache.org/thread.html/8738c13ad7e576bc2fef158d2cc6f809e1c238ab8d5164c78484bf54@%3Cdev.beam.apache.org%3E
>> [2]
>> https://docs.google.com/document/d/15m1MziZ5TNd9rh_XN0YYBJfYkt0Oj-Ou9g0KFDPL2aA/edit?usp=sharing
>>
>
>


[Proposal] Apache Beam's Public Project Roadmap

2018-06-06 Thread Griselda Cuevas
Hi Beam Community,

I'd like to propose the creation of a Public Project Roadmap. Here are the
details as well as some artifacts I started already.

---_---

*Proposal *


*What?*

Create a simple spreadsheet-based project roadmap to generate an overview
of all the efforts driven by different members of the community. I propose
doing something like this tracker I put together

 [1].


*Why?*

*The ultimate purpose of this roadmap is to create a tool and method to
make our project more transparent and efficient. It will help us map
dependencies, identify collaboration areas and communicate future releases
to users more clearly. *


*How?*

We could update the current status of the roadmap every three months
(quarter), following these guidelines and timeline

 [2].


Thanks!
G

[1]
https://docs.google.com/spreadsheets/d/1W6xvPmGyG8Nd9R7wkwgwRJvZdyLoBg6F3NCrPafbKmk/edit#gid=0


[2]
https://docs.google.com/document/d/1jzASbp_5GrZdsF5YXau-MPh_v0jW6egJwL_Pyac5zeE/edit#heading=h.6dyyhz5krl9v


Re: [VOTE] Policies for managing Beam dependencies

2018-06-06 Thread Ahmet Altay
+1

Thank you for driving these decisions. I would make a meta-point, all other
recent votes and if passes this one could be converted to web site
documents at some point in an easily accessible and linkable way.

On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath 
wrote:

> Hi All,
>
> We recently had a discussion regarding managing Beam dependencies. Please
> see [1] for the email thread and [2] for the relevant document.
>
> This discussion resulted in following policies. I believe, these will help
> keep Beam at a healthy state while allowing human intervention when needed.
>
> (1) Human readable reports on status of Beam dependencies are generated
> weekly and shared with the Beam community through the dev list.
>
> (2) Beam components should define dependencies and their versions at the
> top level.
>
> (3) A significantly outdated dependency (identified manually or through
> tooling) should result in a JIRA that is a blocker for the next release.
> Release manager may choose to push the blocker to the subsequent release or
> downgrade from a blocker.
>
> (4) Dependency declarations may identify owners that are responsible for
> upgrading the respective dependencies.
>
> (5) Dependencies of Java SDK components that may cause issues to other
> components if leaked should be shaded.
>
>
> Please vote:
> [ ] +1, Approve that we adapt these policies
> [ ] -1, Do not approve (please provide specific comments)
>
> Thanks,
> Cham
>
> [1] https://lists.apache.org/thread.html/8738c13ad7e576bc2fef158d2cc6f8
> 09e1c238ab8d5164c78484bf54@%3Cdev.beam.apache.org%3E
> [2] https://docs.google.com/document/d/15m1MziZ5TNd9rh_
> XN0YYBJfYkt0Oj-Ou9g0KFDPL2aA/edit?usp=sharing
>


[VOTE] Policies for managing Beam dependencies

2018-06-06 Thread Chamikara Jayalath
Hi All,

We recently had a discussion regarding managing Beam dependencies. Please
see [1] for the email thread and [2] for the relevant document.

This discussion resulted in following policies. I believe, these will help
keep Beam at a healthy state while allowing human intervention when needed.

(1) Human readable reports on status of Beam dependencies are generated
weekly and shared with the Beam community through the dev list.

(2) Beam components should define dependencies and their versions at the
top level.

(3) A significantly outdated dependency (identified manually or through
tooling) should result in a JIRA that is a blocker for the next release.
Release manager may choose to push the blocker to the subsequent release or
downgrade from a blocker.

(4) Dependency declarations may identify owners that are responsible for
upgrading the respective dependencies.

(5) Dependencies of Java SDK components that may cause issues to other
components if leaked should be shaded.


Please vote:
[ ] +1, Approve that we adapt these policies
[ ] -1, Do not approve (please provide specific comments)

Thanks,
Cham

[1]
https://lists.apache.org/thread.html/8738c13ad7e576bc2fef158d2cc6f809e1c238ab8d5164c78484bf54@%3Cdev.beam.apache.org%3E
[2]
https://docs.google.com/document/d/15m1MziZ5TNd9rh_XN0YYBJfYkt0Oj-Ou9g0KFDPL2aA/edit?usp=sharing


Re: Proposal: keeping precommit times fast

2018-06-06 Thread Robert Bradshaw
Even if it's not perfect, seems like it'd surely be a net win (and probably
a large one). Also, the build cache should look back at more than just the
single previous build, so if any previous jobs (up to the cache size limit)
built/tested artifacts unchanged by the current PR, the results would live
in the cache.

I would look at (a) and (b) only if this isn't already good enough.

On Wed, Jun 6, 2018 at 3:50 PM Udi Meiri  wrote:

> To follow up on the Jenkins Job Cacher Plugin:
>
> Using a Jenkins plugin to save and reuse the Gradle cache for successive
> precommit jobs.
> The problem with this approach is that the precommit runs that a Jenkins
> server runs are unrelated.
> Say you have 2 PRs, A and B, and the precommit job for B reuses the cache
> left by the job for A.
> The diff between the two will cause tests affected both by A and B to be
> rerun (at least).
> If A modifies Python code, then the job for B must rerun ALL Python tests
> (since Gradle doesn't do dependency tracking for Python).
>
> Proposal:
> a. The cache plugin is still useful for successive Java precommit jobs,
> but not for Python. (Go, I have no idea)
> We could use it exclusively for Java precommits.
> b. To avoid running precommit jobs for code not touched by a PR, look at
> the paths of files changed.
> For example, a PR touching only files under sdks/python/... need only run
> Python precommit tests.
>
> On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri  wrote:
>
>> I've been having a separate discussion on the proposal doc, which is
>> ready for another round of reviews.
>> Change summary:
>> - Changed fast requirement to be < 30 minutes and simplify the check as
>> an aggregate for each precommit job type.
>> - Updated slowness notification methods to include automated methods: as
>> a precommit check result type on GitHub, as a bug.
>> - Merged in the metrics design doc.
>> - Added detailed design section.
>> - Added list of deliverables.
>>
>> What I would like is consensus regarding:
>> - How fast we want precommit runs to be. I propose 30m.
>> - Deadline for fixing a slow test before it is temporarily removed from
>> precommit. I propose 24 hours.
>>
>>
>> Replying to the thread:
>>
>> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
>> unaffected tests (BEAM-4400).
>>
>> 2. Java Precommit tests include integration tests (example
>> 
>> ).
>> We could split these out to get much faster results, i.e., a separate
>> precommit just for basic integration tests (which will still need to run in
>> <30m).
>> Perhaps lint checks for Python could be split out as well.
>>
>> I'll add these suggestions to the doc tomorrow.
>>
>> On Thu, May 24, 2018 at 9:25 AM Scott Wegner  wrote:
>>
>>> So, it sounds like there's agreement that we should improve precommit
>>> times by only running necessary tests, and configuring Jenkins Job
>>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>>> [1] to follow-up on this.
>>>
>>> Getting back to Udi's original proposal [2]: I see value in defining a
>>> metric and target for overall pre-commit timing. The proposal for an
>>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>>> it, but if we drift to a point where we're not, that should trigger some
>>> action to be taken to get back to a healthy state.
>>>
>>> I wouldn't mind separately setting a more aspiration goal of getting the
>>> pre-commits even faster (i.e. 15-30 mins), but I suspect that would require
>>> a concerted effort to evaluate and improve existing tests across the
>>> codebase. One idea would be to set up ensure the metric reporting can show
>>> the trend, and which tests are responsible for the most walltime, so that
>>> we know where to invest any efforts to improve tests.
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>>> [2]
>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>
>>>
>>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles  wrote:
>>>
 With regard to the Job Cacher Plugin: I think it is an infra ticket to
 install? And I guess we need it longer term when we move to containerized
 builds anyhow? One thing I've experienced with the Travis-CI cache is that
 the time spent uploading & downloading the remote cache - in that case of
 all the pip installed dependencies - negated the benefits. Probably for
 Beam it will have a greater benefit if we can skip most of the build.

 Regarding integration tests in precommit: I think it is OK to run maybe
 one Dataflow job in precommit, but it should be in parallel with the unit
 tests and just a smoke test that takes 5 minutes, not a suite that takes 35
 minutes. So IMO that is low-hanging fruit. If this would make postcommit
 unstable, then it also means precommit is un

Re: Managing outdated dependencies

2018-06-06 Thread Chamikara Jayalath
Since there seems to be a general agreement on these I think we can start a
vote.

Possible post-vote tasks include following.

* Generate human readable reports on status of Beam dependencies.
* Automatically create JIRAs for significantly outdated dependencies based
on above reports.
* Copy component level dependency version declarations to top level.
* Try to identify owners for dependencies and specify owners in comments
close to dependency declarations.
* Shade any dependencies that can cause issues if leaked to other
components (for example, gRPC).

Thanks,
Cham


On Tue, Jun 5, 2018 at 4:38 PM Chamikara Jayalath 
wrote:

> Thanks everybody for all the comments in the doc.
>
> Based on the reaction so far, seems like community generally agrees to
> introducing policies around managing dependencies. I updated the suggested
> policies based on comments and rewrote them in a more concise form and
> added room for more human intervention when needed. Updated policies are
> given below (and mentioned in bold in the document). Please see the
> document for details.
>
> (1) Human readable reports on status of Beam dependencies are generated
> weekly and shared with the Beam community through the dev list.
>
> (2) Beam components should always have dependencies and their versions
> defined at the top level.
>
> (3) A significantly outdated dependency (identified manually or through
> tooling) should result in a JIRA that is a blocker for the next release.
> Release manager may choose to push the blocker to the subsequent release or
> downgrade from a blocker.
>
> (4) Dependency declarations may identify owners that are responsible for
> upgrading the respective dependencies.
>
> (5) Dependencies of Java SDK components that may cause issues to other
> components if leaked should be shaded.
>
> Thanks,
> Cham
>
> On Fri, Jun 1, 2018 at 2:44 PM Chamikara Jayalath 
> wrote:
>
>> Thanks. I left these ideas bit detailed for clarification. I'll rewrite
>> them in a more concise form when we have enough feedback.
>>
>> - Cham
>>
>> On Fri, Jun 1, 2018 at 1:58 PM Kenneth Knowles  wrote:
>>
>>> This does seem really useful. I appreciate the detailed explanations. If
>>> we formalize it into policy, I'd love to make it a bit more concise, and
>>> with appropriate room for human contestation of the guidelines.
>>>
>>> On Fri, Jun 1, 2018 at 1:47 PM Scott Wegner  wrote:
>>>
 Thanks Cham. Overall this seems like a useful hygiene improvement for
 the project. I've left some comments in the doc.

 On Fri, Jun 1, 2018 at 10:48 AM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Hi All,
>
> I've copied ideas proposed in the doc below for more visibility. Any
> comments are welcome.
>
>
>
> * - Human readable per-SDK reports on status of Beam dependencies are
> generated weekly and shared with the Beam community through the dev list.
> These reports should be concise and should highlight the cases where the
> community has to act on. See [4] for more details on this.- Beam 
> Components
> (IO connectors, runners, etc) should always try to use versions of
> dependencies that are defined at the top level. Per-component dependency
> version overrides should only be performed in rare cases and should come
> with clear warnings for users.- Upgrading a dependency with an outdated
> major version becomes a blocker for next major version release of Beam and
> for any minor version releases after next immediate minor version release.
> For example, if a dependency is identified to be outdated while the latest
> release is x.y.z, upgrading this dependency becomes a blocker for releases
> (x+1).0.0 and x.(y+2).0 of Beam. Additionally, upgrading to a major 
> version
> of a dependency will only be enforced if the new major version of the
> dependency can be adapted without a significant rewrite to any Beam
> component. Note that this policy intentionally allows one of the minor
> version releases to proceed without upgrading the dependency which I
> believe will give Beam community enough breathing room to upgrade
> dependencies without significantly affecting the release frequency.-
> Upgrading a dependency with a significantly outdated minor version (based
> on methodology defined in [4]) becomes a blocker for next major version
> release of Beam and for any minor version releases of Beam after next
> immediate minor version release. Note that this policy does not force Beam
> to adapt every minor version release of a dependency.- When performing a
> release, release manager should make sure that blockers identified through
> above process are resolved before the release candidate is cut.-
> Optionally, dependency declarations may have comments that identify owners
> that should be responsible for upgrading the respective dependencies.
> Release manager may c

Re: Proposal: keeping precommit times fast

2018-06-06 Thread Udi Meiri
To follow up on the Jenkins Job Cacher Plugin:

Using a Jenkins plugin to save and reuse the Gradle cache for successive
precommit jobs.
The problem with this approach is that the precommit runs that a Jenkins
server runs are unrelated.
Say you have 2 PRs, A and B, and the precommit job for B reuses the cache
left by the job for A.
The diff between the two will cause tests affected both by A and B to be
rerun (at least).
If A modifies Python code, then the job for B must rerun ALL Python tests
(since Gradle doesn't do dependency tracking for Python).

Proposal:
a. The cache plugin is still useful for successive Java precommit jobs, but
not for Python. (Go, I have no idea)
We could use it exclusively for Java precommits.
b. To avoid running precommit jobs for code not touched by a PR, look at
the paths of files changed.
For example, a PR touching only files under sdks/python/... need only run
Python precommit tests.

On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri  wrote:

> I've been having a separate discussion on the proposal doc, which is ready
> for another round of reviews.
> Change summary:
> - Changed fast requirement to be < 30 minutes and simplify the check as an
> aggregate for each precommit job type.
> - Updated slowness notification methods to include automated methods: as a
> precommit check result type on GitHub, as a bug.
> - Merged in the metrics design doc.
> - Added detailed design section.
> - Added list of deliverables.
>
> What I would like is consensus regarding:
> - How fast we want precommit runs to be. I propose 30m.
> - Deadline for fixing a slow test before it is temporarily removed from
> precommit. I propose 24 hours.
>
>
> Replying to the thread:
>
> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
> unaffected tests (BEAM-4400).
>
> 2. Java Precommit tests include integration tests (example
> 
> ).
> We could split these out to get much faster results, i.e., a separate
> precommit just for basic integration tests (which will still need to run in
> <30m).
> Perhaps lint checks for Python could be split out as well.
>
> I'll add these suggestions to the doc tomorrow.
>
> On Thu, May 24, 2018 at 9:25 AM Scott Wegner  wrote:
>
>> So, it sounds like there's agreement that we should improve precommit
>> times by only running necessary tests, and configuring Jenkins Job
>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>> [1] to follow-up on this.
>>
>> Getting back to Udi's original proposal [2]: I see value in defining a
>> metric and target for overall pre-commit timing. The proposal for an
>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>> it, but if we drift to a point where we're not, that should trigger some
>> action to be taken to get back to a healthy state.
>>
>> I wouldn't mind separately setting a more aspiration goal of getting the
>> pre-commits even faster (i.e. 15-30 mins), but I suspect that would require
>> a concerted effort to evaluate and improve existing tests across the
>> codebase. One idea would be to set up ensure the metric reporting can show
>> the trend, and which tests are responsible for the most walltime, so that
>> we know where to invest any efforts to improve tests.
>>
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>> [2]
>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>
>>
>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles  wrote:
>>
>>> With regard to the Job Cacher Plugin: I think it is an infra ticket to
>>> install? And I guess we need it longer term when we move to containerized
>>> builds anyhow? One thing I've experienced with the Travis-CI cache is that
>>> the time spent uploading & downloading the remote cache - in that case of
>>> all the pip installed dependencies - negated the benefits. Probably for
>>> Beam it will have a greater benefit if we can skip most of the build.
>>>
>>> Regarding integration tests in precommit: I think it is OK to run maybe
>>> one Dataflow job in precommit, but it should be in parallel with the unit
>>> tests and just a smoke test that takes 5 minutes, not a suite that takes 35
>>> minutes. So IMO that is low-hanging fruit. If this would make postcommit
>>> unstable, then it also means precommit is unstable. Both are troublesome.
>>>
>>> More short term, some possible hacks:
>>>
>>>  - Point gradle to cache outside the git workspace. We already did this
>>> for .m2 and it helped a lot.
>>>  - Intersect touched files with projects. Our nonstandard project names
>>> might be a pain here. Not sure if fixing that is on the roadmap.
>>>
>>> Kenn
>>>
>>> On Wed, May 23, 2018 at 9:31 AM Ismaël Mejía  wrote:
>>>
 I second Robert idea of ‘inteligently’ running only the affected tests,
 probably
 there is no need to run Java for a go fix (a

Re: SDK Harness Deployment

2018-06-06 Thread Henning Rohde
Thanks Thomas. The id provided to the SDK harness must be sent as a gRPC
header when it connects to the TM. The TM can use a fixed port and
multiplex requests based on that id - to match the SDK harness with the
appropriate job/slot/whatnot. The relationship between SDK harness and TM
is not limited to 1:1, but rather many:1. We'll likely need that for
cross-language as well. Wouldn't multiplexing on a single port for the
control plane be the easiest solution for both #1 and #2? The data plane
can still use various dynamically-allocated ports.

On Kubernetes, we're somewhat constrained by the pod lifetime and multi-job
TMs might not be as natural to achieve.

Thanks,
 Henning

On Wed, Jun 6, 2018 at 2:28 PM Thomas Weise  wrote:

> Hi Henning,
>
> Here is a page that explains the scheduling and overall functioning of the
> task manager in Flink:
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.5/internals/job_scheduling.html
>
> Here are the 2 issues:
>
> #1 each task manager process get assigned multiple units of execution into
> task slots. So when we deploy a Beam pipeline, we can end up with multiple
> executable stages running in a single TM JVM.
>
> This where a 1-to-1 relationship between TM and SDK harness can lead to a
> bottleneck (all task slots of a single TM push their work to a single SDK
> container).
>
> #2 in a deployment where multiple pipelines share a Flink cluster, the SDK
> harness per TM approach wouldn't work logically. We would need to have
> multiple SDK containers, not just for efficiency reasons.
>
> This would not be an issue for the deployment scenario I'm looking at, but
> it needs to be considered for general Flink runner deployment.
>
> Regarding the assignment of fixed endpoints within the TM, that is
> possible but it doesn't address #1 and #2.
>
> I hope this clarifies?
>
> Thanks,
> Thomas
>
>
> On Wed, Jun 6, 2018 at 12:31 PM, Henning Rohde  wrote:
>
>> Thanks for writing down and explaining the problem, Thomas. Let me try to
>> tease some of the topics apart.
>>
>> First, the basic setup is currently as follows: there are 2 worker
>> processes (A) "SDK harness" and (B) "Runner harness" that needs to
>> communicate. A connects to B. The fundamental endpoint(s) of B as well as
>> an id -- logging, provisioning, artifacts and control -- are provided to A
>> via command line parameters. A is not expected to be able to connect to the
>> control port without first obtaining pipeline options (from provisioning)
>> and staged files (from artifacts). As an side, this is where the separate
>> boot.go code comes in handy. A can assume it will be restarted, if it
>> exits. A does not assume the given endpoints are up when started and should
>> make blocking calls with timeout (but if not and exits, it is restarted
>> anyway and will retry). Note that the data plane endpoints are part of the
>> control instructions and need not be known or allocated at startup or even
>> be served by the same TM.
>>
>> Second, whether or not docker is used is rather an implementation detail,
>> but if we use Kubernetes (or other such options) then some constraints come
>> into play.
>>
>> Either way, two scenarios work well:
>>(1) B starts A: The ULR and Flink prototype does this. B will delay
>> starting A until it has decided which endpoints to use. This approach
>> requires B to do process/container management, which we'd rather not have
>> to do at scale. But it's convenient for local runners.
>>(2) B has its (local) endpoints configured or fixed: A and B can be
>> started concurrently. Dataflow does this. Kubernetes lends itself well to
>> this approach (and handles container management for us).
>>
>> The Flink on Kubernetes scenario described above doesn't:
>>(3) B must use randomized (local) endpoints _and_ A and B are started
>> concurrently: A would not know where to connect.
>>
>> Perhaps I'm not understanding the constraints of the TM well enough, but
>> can we really not open a configured/fixed port from the TM -- especially in
>> a network-isolated Kubernetes pod? Adding a third process (C) "proxy" to
>> the pod might by an alternative option and morph (3) into (2). B would
>> configure C when it's ready. A would connect to C, but be blocked until B
>> has configured it. C could perhaps even serve logging, provisioning, and
>> artifacts without B. And the data plane would not go over C anyway. If
>> control proxy'ing is a concern, then alternatively we would add an
>> indirection to the container contract and provide the control endpoint in
>> the provisioning api, say, or even a new discovery service.
>>
>> There are of course other options and tradeoffs, but having Flink work on
>> Kubernetes and not go against the grain seems desirable to me.
>>
>> Thanks,
>>  Henning
>>
>>
>> On Wed, Jun 6, 2018 at 10:11 AM Thomas Weise  wrote:
>>
>>> Hi,
>>>
>>> The current plan for running the SDK harness is to execute docker to
>>> launch SDK containers with service en

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Lukasz Cwik
I have added the 2.5.0 tab to the validation spreadsheet[1], please mark
down which things you intend to validate for the release and update the
community on progress.

1:
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807

On Wed, Jun 6, 2018 at 1:51 PM Reuven Lax  wrote:

> Agreed 💯! It's not ready being the first to try something. Thank you so
> much for helping blaze the way!
>
> Reuven
>
> On Wed, Jun 6, 2018, 11:50 AM Etienne Chauchot 
> wrote:
>
>> Thanks JB for all your work ! I believe doing the first gradle release
>> must have been hard.
>> I'll run Nexmark on the release and keep you posted.
>>
>> Best
>> Etienne
>>
>>
>> Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
>>
>> Hi everyone,
>>
>>
>> Please review and vote on the release candidate #1 for the version
>>
>> 2.5.0, as follows:
>>
>>
>> [ ] +1, Approve the release
>>
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> NB: this is the first release using Gradle, so don't be too harsh ;) A
>>
>> PR about the release guide will follow thanks to this release.
>>
>>
>> The complete staging area is available for your review, which includes:
>>
>> * JIRA release notes [1],
>>
>> * the official Apache source release to be deployed to dist.apache.org
>>
>> [2], which is signed with the key with fingerprint C8282E76 [3],
>>
>> * all artifacts to be deployed to the Maven Central Repository [4],
>>
>> * source code tag "v2.5.0-RC1" [5],
>>
>> * website pull request listing the release and publishing the API
>>
>> reference manual [6].
>>
>> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
>>
>> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
>>
>> * Python artifacts are deployed along with the source release to the
>>
>> dist.apache.org [2].
>>
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>>
>> approval, with at least 3 PMC affirmative votes.
>>
>>
>> Thanks,
>>
>> JB
>>
>>
>> [1]
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
>>
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>>
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>
>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1041/
>>
>> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
>>
>> [6] https://github.com/apache/beam-site/pull/463
>>
>>
>>


Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Ted Yu
+1 on this effort
 Original message From: Chamikara Jayalath 
 Date: 6/6/18  2:09 PM  (GMT-08:00) To: 
dev@beam.apache.org, u...@beam.apache.org Subject: Re: [DISCUSS] [BEAM-4126] 
Deleting Maven build files (pom.xml) grace period? 
+1 for the overall effort. As Pablo mentioned, we need some time to migrate 
internal Dataflow build off of Maven build files. I created 
https://issues.apache.org/jira/browse/BEAM-4512 for this.

Thanks,Cham
On Wed, Jun 6, 2018 at 1:30 PM Eugene Kirpichov  wrote:
Is it possible for Dataflow to just keep a copy of the pom.xmls and delete it 
as soon as Dataflow is migrated?
Overall +1, I've been using Gradle without issues for a while and almost forgot 
pom.xml's still existed.

On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:
I agree that we should delete the pom.xml files soon, as they create a burden 
for maintainers. 
I'd like to be able to extend the grace period by a bit, to allow the internal 
build systems at Google to move away from using the Beam poms.
We use these pom files to build Dataflow workers, and thus it's critical for us 
that they are available for a few more weeks while we set up a gradle build. 
Perhaps 4 weeks? (Calling out+Chamikara Jayalath who has recently worked on 
internal Dataflow tooling.)
Best-P.
On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:
Note: Apache Beam will still provide pom.xml for each release it produces. This 
is only about people using Maven to build Apache Beam themselves and not 
relying on the released artifacts in Maven Central.
With the first release using Gradle as the build system is underway, I wanted 
to start this thread to remind people that we are going to delete the Maven 
pom.xml files after the 2.5.0 release is finalized plus a two week grace period.
Are there others who would like a shorter/longer grace period?

The PR to delete the pom.xml is here: https://github.com/apache/beam/pull/5571
-- 
Got feedback? go/pabloem-feedback




Re: Beam breaks when it isn't loaded via the Thread Context Class Loader

2018-06-06 Thread Lukasz Cwik
Romain, can you point to an example of a global singleton registry that
does this right for class loading (it may allow people to work towards such
an effort)?

On Tue, Jun 5, 2018 at 10:06 PM Romain Manni-Bucau 
wrote:

> It is actually very localised in runner code where beam should reset the
> classloader when the deserialization happens and then the runner owns the
> classloader all the way in evaluators.
>
> If IO change the classloader they must indeed handle it too and patch the
> deserialization too.
>
> Here again (we mentionned it multiple times in other threads) beam misses
> a global singleton registry where you can register a "service" to look it
> up based of a serialization configuration and a lifecycle allowing to close
> the classloader in all instances without hacks.
>
>
> Le mar. 5 juin 2018 23:50, Kenneth Knowles  a écrit :
>
>> Perhaps we can also adopt a practice of making our own APIs explicitly
>> pass a Classloader when appropriate so we only have to set this when we are
>> entering code that does not have good hygiene. It might actually be nice to
>> have a lightweight static analysis to forbid bad methods in our code.
>>
>> Kenn
>>
>> On Mon, Jun 4, 2018 at 3:43 PM Lukasz Cwik  wrote:
>>
>>> I totally agree, but there are so many Java APIs (including ours) that
>>> messed this up so everyone lives with the same hack.
>>>
>>> On Mon, Jun 4, 2018 at 3:41 PM Andrew Pilloud 
>>> wrote:
>>>
 It seems like a terribly fragile way to pass arguments but my tests
 pass when I wrap the JDBC path into Beam pipeline execution with that
 pattern.

 Thanks!

 Andrew

 On Mon, Jun 4, 2018 at 3:20 PM Lukasz Cwik  wrote:

> It is a common mistake for APIs to not include a way to specify which
> class loader to use when doing something like deserializing an instance of
> a class via the ObjectInputStream. This common issue also affects Apache
> Beam (SerializableCoder, PipelineOptionsFactory, ...) and the way that
> typical Java APIs have gotten around this is to use to the thread context
> class loader (TCCL) as the way to plumb this additional attribute through.
> So Apache Beam is meant to in all places honor the TCCL if it has been set
> as most Java libraries (not all) do the same hack.
>
> In most environments the TCCL is not set and we are working with a
> single class loader. It turns out that in more complicated environments
> (like when loading a JDBC driver, or JNDI, or an application server, ...)
> this usually doesn't work without each caller knowing what class loading
> context they should be in. A common work around for most scenarios is to
> always set the TCCL to the current classes class loader like so before
> invoking any APIs that do class loading so you don't propagate the TCCL of
> the caller along since they may have set it for some other reason:
>
> ClassLoader originalClassLoader = 
> Thread.currentThread().getContextClassLoader();try {
> 
> Thread.currentThread().setContextClassLoader(getClass().getClassLoader());
> // call some API that uses reflection without taking ClassLoader 
> param} finally {
> Thread.currentThread().setContextClassLoader(originalClassLoader);}
>
>
>
> On Mon, Jun 4, 2018 at 1:57 PM Andrew Pilloud 
> wrote:
>
>> I'm having class loading issues that go away when I revert the
>> changes in our use of Class.forName added in
>> https://github.com/apache/beam/pull/4674. The problem I'm having is
>> that the typical JDBC GUI (SqlWorkbench/J, SQuirreL SQL) creates an
>> isolated class loader to load our library. Things work if we call
>> Class.forName with the default class loader [getClass().getClassLoader() 
>> or
>> no argument] but not if we use the thread context class loader
>> [Thread.currentThread().getContextClassLoader() or
>> ReflectHelpers.findClassLoader()]. Why is using the default class loader
>> not the right thing to do? How can I fix this problem?
>>
>> See this integration test for an example:
>> https://github.com/apilloud/beam/blob/directrunner/sdks/java/extensions/sql/jdbc/src/test/java/org/apache/beam/sdk/extensions/sql/jdbc/JdbcIT.java#L44
>>
>> https://scans.gradle.com/s/iquqinhns2ymi/tests/slmg6ytuuqlus-akh5xpgshj32k
>>
>> Andrew
>>
>


Re: SDK Harness Deployment

2018-06-06 Thread Thomas Weise
Hi Henning,

Here is a page that explains the scheduling and overall functioning of the
task manager in Flink:

https://ci.apache.org/projects/flink/flink-docs-release-1.5/internals/job_scheduling.html

Here are the 2 issues:

#1 each task manager process get assigned multiple units of execution into
task slots. So when we deploy a Beam pipeline, we can end up with multiple
executable stages running in a single TM JVM.

This where a 1-to-1 relationship between TM and SDK harness can lead to a
bottleneck (all task slots of a single TM push their work to a single SDK
container).

#2 in a deployment where multiple pipelines share a Flink cluster, the SDK
harness per TM approach wouldn't work logically. We would need to have
multiple SDK containers, not just for efficiency reasons.

This would not be an issue for the deployment scenario I'm looking at, but
it needs to be considered for general Flink runner deployment.

Regarding the assignment of fixed endpoints within the TM, that is possible
but it doesn't address #1 and #2.

I hope this clarifies?

Thanks,
Thomas


On Wed, Jun 6, 2018 at 12:31 PM, Henning Rohde  wrote:

> Thanks for writing down and explaining the problem, Thomas. Let me try to
> tease some of the topics apart.
>
> First, the basic setup is currently as follows: there are 2 worker
> processes (A) "SDK harness" and (B) "Runner harness" that needs to
> communicate. A connects to B. The fundamental endpoint(s) of B as well as
> an id -- logging, provisioning, artifacts and control -- are provided to A
> via command line parameters. A is not expected to be able to connect to the
> control port without first obtaining pipeline options (from provisioning)
> and staged files (from artifacts). As an side, this is where the separate
> boot.go code comes in handy. A can assume it will be restarted, if it
> exits. A does not assume the given endpoints are up when started and should
> make blocking calls with timeout (but if not and exits, it is restarted
> anyway and will retry). Note that the data plane endpoints are part of the
> control instructions and need not be known or allocated at startup or even
> be served by the same TM.
>
> Second, whether or not docker is used is rather an implementation detail,
> but if we use Kubernetes (or other such options) then some constraints come
> into play.
>
> Either way, two scenarios work well:
>(1) B starts A: The ULR and Flink prototype does this. B will delay
> starting A until it has decided which endpoints to use. This approach
> requires B to do process/container management, which we'd rather not have
> to do at scale. But it's convenient for local runners.
>(2) B has its (local) endpoints configured or fixed: A and B can be
> started concurrently. Dataflow does this. Kubernetes lends itself well to
> this approach (and handles container management for us).
>
> The Flink on Kubernetes scenario described above doesn't:
>(3) B must use randomized (local) endpoints _and_ A and B are started
> concurrently: A would not know where to connect.
>
> Perhaps I'm not understanding the constraints of the TM well enough, but
> can we really not open a configured/fixed port from the TM -- especially in
> a network-isolated Kubernetes pod? Adding a third process (C) "proxy" to
> the pod might by an alternative option and morph (3) into (2). B would
> configure C when it's ready. A would connect to C, but be blocked until B
> has configured it. C could perhaps even serve logging, provisioning, and
> artifacts without B. And the data plane would not go over C anyway. If
> control proxy'ing is a concern, then alternatively we would add an
> indirection to the container contract and provide the control endpoint in
> the provisioning api, say, or even a new discovery service.
>
> There are of course other options and tradeoffs, but having Flink work on
> Kubernetes and not go against the grain seems desirable to me.
>
> Thanks,
>  Henning
>
>
> On Wed, Jun 6, 2018 at 10:11 AM Thomas Weise  wrote:
>
>> Hi,
>>
>> The current plan for running the SDK harness is to execute docker to
>> launch SDK containers with service endpoints provided by the runner in the
>> docker command line.
>>
>> In the case of Flink runner (prototype), the service endpoints are
>> dynamically allocated per executable stage. There is typically one Flink
>> task manager running per machine. Each TM has multiple task slots. A subset
>> of these task slots will run the Beam executable stages. Flink allows
>> multiple jobs in one TM, so we could have executable stages of different
>> pipelines running in a single TM, depending on how users deploy. The
>> prototype also has no cleanup for the SDK containers, they remain running
>> and orphaned once the runner is gone.
>>
>> I'm trying to find out how this approach can be augmented for deployment
>> on Kubernetes. Our deployments won't allow multiple jobs per task manager,
>> so all task slots will belong to the same pipeline context. The int

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Chamikara Jayalath
+1 for the overall effort. As Pablo mentioned, we need some time to migrate
internal Dataflow build off of Maven build files. I created
https://issues.apache.org/jira/browse/BEAM-4512 for this.

Thanks,
Cham

On Wed, Jun 6, 2018 at 1:30 PM Eugene Kirpichov 
wrote:

> Is it possible for Dataflow to just keep a copy of the pom.xmls and delete
> it as soon as Dataflow is migrated?
>
> Overall +1, I've been using Gradle without issues for a while and almost
> forgot pom.xml's still existed.
>
> On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:
>
>> I agree that we should delete the pom.xml files soon, as they create a
>> burden for maintainers.
>>
>> I'd like to be able to extend the grace period by a bit, to allow the
>> internal build systems at Google to move away from using the Beam poms.
>>
>> We use these pom files to build Dataflow workers, and thus it's critical
>> for us that they are available for a few more weeks while we set up a
>> gradle build. Perhaps 4 weeks?
>> (Calling out+Chamikara Jayalath  who has recently
>> worked on internal Dataflow tooling.)
>>
>> Best
>> -P.
>>
>> On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:
>>
>>> Note: Apache Beam will still provide pom.xml for each release it
>>> produces. This is only about people using Maven to build Apache Beam
>>> themselves and not relying on the released artifacts in Maven Central.
>>>
>>> With the first release using Gradle as the build system is underway, I
>>> wanted to start this thread to remind people that we are going to delete
>>> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
>>> week grace period.
>>>
>>> Are there others who would like a shorter/longer grace period?
>>>
>>> The PR to delete the pom.xml is here:
>>> https://github.com/apache/beam/pull/5571
>>>
>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Reuven Lax
Agreed 💯! It's not ready being the first to try something. Thank you so
much for helping blaze the way!

Reuven

On Wed, Jun 6, 2018, 11:50 AM Etienne Chauchot  wrote:

> Thanks JB for all your work ! I believe doing the first gradle release
> must have been hard.
> I'll run Nexmark on the release and keep you posted.
>
> Best
> Etienne
>
>
> Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
>
> Hi everyone,
>
>
> Please review and vote on the release candidate #1 for the version
>
> 2.5.0, as follows:
>
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> NB: this is the first release using Gradle, so don't be too harsh ;) A
>
> PR about the release guide will follow thanks to this release.
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release to be deployed to dist.apache.org
>
> [2], which is signed with the key with fingerprint C8282E76 [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "v2.5.0-RC1" [5],
>
> * website pull request listing the release and publishing the API
>
> reference manual [6].
>
> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
>
> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
>
> * Python artifacts are deployed along with the source release to the
>
> dist.apache.org [2].
>
>
> The vote will be open for at least 72 hours. It is adopted by majority
>
> approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
>
> JB
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
>
> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1041/
>
> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
>
> [6] https://github.com/apache/beam-site/pull/463
>
>
>


Re: Existing transactionality inconsistency in the Beam Java State API

2018-06-06 Thread Lukasz Cwik
Sounds great and thanks for the conclusion summary.

On Tue, Jun 5, 2018 at 4:56 PM Charles Chen  wrote:

> Thanks everyone for commenting and contributing to the discussion.  There
> appears to be enough consensus on these points to start an initial
> implementation.  Specifically, I'd like to highlight from the doc (
> https://docs.google.com/document/d/1GadEkAmtbJQjmqiqfSzGw3b66TKerm8tyn6TK4blAys/edit#heading=h.ofyl9jspiz3b
> ):
>
> *With respect to existing state data and transactionality: *We will go
> with presenting a consistent view of state data, in that after a write, a
> read will always return a value that reflects the result of that write.
> When a read returning an iterator object is done, an implicit snapshot of
> the underlying value at that point will be taken, and any subsequent
> mutation of the state will not change the values contained in, or
> invalidate, a previously returned iterator.  We will use state.prefetch()
> and state.prefetch_key(key) to suggest that the runner prefetch relevant
> data; this will supercede and replace the existing state.readLater()
> methods from the Java API, since the semantics of saying "prefetch" are
> much clearer to the user.
>
> *With respect to state for merging windows: *We will first implement proposed
> design 3
> 
> by excluding an implementation of non-combinable state types like the
> non-combinable ValueState, since this is the most forwards-compatible
> option.  At a later point, we will either obtain community consensus to
> remove non-combinable ValueState from the Java SDK as well, or turn to 
> proposed
> design 1
> 
> by implementing non-combinable state types in the Python SDK and reject
> jobs that use non-combinable state types with merging windows.
>
> Best,
> Charles
>
> On Fri, May 25, 2018 at 10:56 AM Lukasz Cwik  wrote:
>
>> Great, I was confused in the description that was provided and then the
>> follow up by Ben. I think its worthwhile to describe the differences with
>> actual examples of what happens.
>>
>> On Fri, May 25, 2018 at 10:54 AM Kenneth Knowles  wrote:
>>
>>> I think the return value of read() should always be an immutable value.
>>>
>>> Kenn
>>>
>>> On Fri, May 25, 2018 at 10:44 AM Lukasz Cwik  wrote:
>>>
 Kenn, in the second example where we are creating views whenever read()
 is called, is it that the view's underlying data is immutable. For example:
 Iterable values = state.read();
 state.append("newValue");
 If I iterate over values, does values now contain "newValues"?


 On Thu, May 24, 2018 at 10:38 AM Kenneth Knowles 
 wrote:

> I see what you mean but I don't agree that futures imply anything
> other than "it is a value that you have to force", with deliberately many
> possible implementations. When at the point in 1 and you've got
>
> interface ReadableState {
> T read()
> }
>
> and you want to improve performance, both approaches "void
> readLater()" and "StateFuture read()" are natural evolutions. They both
> gain the same 10x and they both support the "unchanging committed state
> plus buffered mutations" implementation well. And snapshots are 
> essentially
> free for that implementation if the buffered mutations are stored in a
> decent data structure.
>
> My recollection was that futures were seen as more cumbersome. They
> affect the types even for simple uses. The only appealing future API was
> Guava's, but we didn't want that on the API surface. And we did not intend
> for these to be used in complex ways, so the usability & perf benefits of 
> a
> future-based API weren't really realized anyhow.
>
> The only reason I belabor this is that if we ever wanted to support
> more complex use cases, such as concurrent use of state, then my 
> preference
> would flip. I wouldn't want to make XYZState a synchronized monitor. At
> that point I'd favor using a snapshots-are-free concurrent data structure
> under the hood of a future-based API.
>
> Since there is really only one implementation in mind for this, maybe
> only one that works reasonably, we should just document it as such. The
> docs on ReadableState do make it sound like writes will invalidate the
> usefulness of readLater, even though that isn't the case for the intended
> implementation strategy.
>
> Kenn
>
> On Thu, May 24, 2018 at 9:40 AM Ben Chambers 
> wrote:
>
>> I think Kenn's second option accurately reflects my memory of the
>> original intentions:
>>
>> 1. I remember we we considered either using the Future interface or
>> calling the ReadableState interface a future, and explicitly 

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Eugene Kirpichov
Is it possible for Dataflow to just keep a copy of the pom.xmls and delete
it as soon as Dataflow is migrated?

Overall +1, I've been using Gradle without issues for a while and almost
forgot pom.xml's still existed.

On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:

> I agree that we should delete the pom.xml files soon, as they create a
> burden for maintainers.
>
> I'd like to be able to extend the grace period by a bit, to allow the
> internal build systems at Google to move away from using the Beam poms.
>
> We use these pom files to build Dataflow workers, and thus it's critical
> for us that they are available for a few more weeks while we set up a
> gradle build. Perhaps 4 weeks?
> (Calling out+Chamikara Jayalath  who has recently
> worked on internal Dataflow tooling.)
>
> Best
> -P.
>
> On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:
>
>> Note: Apache Beam will still provide pom.xml for each release it
>> produces. This is only about people using Maven to build Apache Beam
>> themselves and not relying on the released artifacts in Maven Central.
>>
>> With the first release using Gradle as the build system is underway, I
>> wanted to start this thread to remind people that we are going to delete
>> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
>> week grace period.
>>
>> Are there others who would like a shorter/longer grace period?
>>
>> The PR to delete the pom.xml is here:
>> https://github.com/apache/beam/pull/5571
>>
> --
> Got feedback? go/pabloem-feedback
> 
>


Re: Proposal: keeping post-commit tests green

2018-06-06 Thread Mikhail Gryzykhin
Hello everyone,

Most of the comments on my last draft addressed technical details of
automation implementation of specific processes proposed. No major process
changes were suggested.

If you have not yet, please review this document.

Highlights from last change:
* Bumped splitting tests jobs after Kenneths comment.
* No-commit in case of too many open JIRA tickets (metric was there, action
was missing)
* No-commit in case of too old JIRA ticket (metric was there, action was
missing)
* Closed comments that are addressed in document.

This document already has two LGTMs from Scott Wegner and Thomas Weise.
If no major comments will come, I'll treat this document as complete and
start working on implementing work items defined in this document.

Thank you,
--Mikhail


On Tue, Jun 5, 2018 at 7:38 PM Thomas Weise  wrote:

> Thanks for taking this initiative. As the number of contributors grows, so
> does the cost of broken builds. I'm also in favor of locking master merges
> until related issues are fixed (short term pain for long term gain). It
> would penalize a few for the benefit of many.
>
> On that note, recently we also had a fair share of pre-commit build
> issues, with a few making their way to master. These include instances
> unrelated to build tooling, such as compile error or packaging. I don't
> think we should run PR merges over the red light and suggest it is
> necessary to step up the gatekeeper responsibility committers have.
>
> Thanks,
> Thomas
>
>
> On Tue, Jun 5, 2018 at 10:56 AM, Scott Wegner  wrote:
>
>> I've taken another pass over the doc, and it looks good to me. Thanks for
>> driving this effort!
>>
>> On Mon, Jun 4, 2018 at 9:08 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I have addressed comments on the proposal doc and updated it
>>> accordingly. I have also added section on metrics that we want to track for
>>> pre-commit tests and contents for dashboard.
>>>
>>> Please, take a second look at the document.
>>>
>>> Highlights:
>>> * Sections that I feel require more discussion are marked with *[More
>>> opinions wanted]*
>>> ** I've kept original comments open for this iteration. Please, close
>>> them if you feel those resolved, or elaborate more on the topic.*
>>> * Added information on metrics to track
>>> * Moved “Split test jobs into automatically and manually triggered” to
>>> “Other ideas to consider”
>>> * Prioritized automated JIRA ticket creation over manual
>>> * Prioritized roll-back first policy
>>> * Added process for enforcing proposed policies.
>>>
>>> --Mikhail
>>>
>>> Have feedback ?
>>>
>>>
>>> On Tue, May 22, 2018 at 10:11 AM Scott Wegner 
>>> wrote:
>>>
 Thanks for the thoughtful proposal Mikhail. I've left some comments in
 the doc.

 I encourage others to take a look: the proposal adds some strong
 policies about dealing with post-commit failures (rollback policy, locking
 master). Currently our post-commits are frequently red, and we're missing
 out on a valuable quality signal. I'm in favor of such policies to help get
 the test signals back to a healthy state.

 On Mon, May 21, 2018 at 2:48 PM Mikhail Gryzykhin 
 wrote:

> Hi Everyone,
>
> I've updated design doc according to comments.
>
> https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME
>
> In general, ideas proposed seem to be appreciated. Still, some of
> sections require more discussion.
>
> Changes highlight:
> * Added roll-back first policy to best practices. This includes
> process on how to handle roll-back.
> * Marked topics that I'd like to have more input on. [cyan color]
>
> --Mikhail
>
> Have feedback ?
>
>
> On Fri, May 18, 2018 at 10:56 AM Andrew Pilloud 
> wrote:
>
>> Blocking commits to master on test flaps seems critical here. The
>> test flaps won't get the attention they deserve as long as people are 
>> just
>> spamming their PRs with 'Run Java Precommit' until they turn green. I'm
>> guilty of this behavior and I know it masks new flaky tests.
>>
>> I added a comment to your doc about detecting flaky tests. This can
>> easily be done by rerunning the postcommits during times when Jenkins 
>> would
>> otherwise be idle. You'll easily get a few dozen runs every weekend, you
>> just need a process to triage all the flakes and ensure there are bugs. I
>> worked on a project that did this along with blocking master on any post
>> commit failure. It was painful for the first few weeks, but things got
>> significantly better once most of the bugs were fixed.
>>
>> Andrew
>>
>> On Fri, May 18, 2018 at 10:39 AM Kenneth Knowles 
>> wrote:
>>
>>> Love it. I would pull out from the doc also the key point: make the
>>> postcommit status constantly visible to everyone.

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Pablo Estrada
I agree that we should delete the pom.xml files soon, as they create a
burden for maintainers.

I'd like to be able to extend the grace period by a bit, to allow the
internal build systems at Google to move away from using the Beam poms.

We use these pom files to build Dataflow workers, and thus it's critical
for us that they are available for a few more weeks while we set up a
gradle build. Perhaps 4 weeks?
(Calling out+Chamikara Jayalath  who has recently
worked on internal Dataflow tooling.)

Best
-P.

On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:

> Note: Apache Beam will still provide pom.xml for each release it produces.
> This is only about people using Maven to build Apache Beam themselves and
> not relying on the released artifacts in Maven Central.
>
> With the first release using Gradle as the build system is underway, I
> wanted to start this thread to remind people that we are going to delete
> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
> week grace period.
>
> Are there others who would like a shorter/longer grace period?
>
> The PR to delete the pom.xml is here:
> https://github.com/apache/beam/pull/5571
>
-- 
Got feedback? go/pabloem-feedback


[DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Lukasz Cwik
Note: Apache Beam will still provide pom.xml for each release it produces.
This is only about people using Maven to build Apache Beam themselves and
not relying on the released artifacts in Maven Central.

With the first release using Gradle as the build system is underway, I
wanted to start this thread to remind people that we are going to delete
the Maven pom.xml files after the 2.5.0 release is finalized plus a two
week grace period.

Are there others who would like a shorter/longer grace period?

The PR to delete the pom.xml is here:
https://github.com/apache/beam/pull/5571


Re: [Call for items] Beam June Newsletter

2018-06-06 Thread Griselda Cuevas
Great question Scott!

Yes, this is a monthly Newsletter and the June Edition should cover the
things that happened in May.

On Wed, 6 Jun 2018 at 08:05, Scott Wegner  wrote:

> Thanks for putting this together, Gris. I'm not familiar with the format:
> what time period this newsletter should cover? Is this a monthly newsletter
> and the June edition covers news from May?
>
> On Tue, Jun 5, 2018 at 4:47 PM Griselda Cuevas  wrote:
>
>> Hi Everyone,
>>
>> Just a reminder to add items to the June Newsletter, the idea behind it
>> is to summarize community efforts in the project to get others identify
>> similar actions, opportunities to collaborate and ask questions. Folks in
>> the mailing list have found these newsletters useful in the past, so let's
>> keep the tradition going :)
>>
>> I'll extend the deadline until 6/7 11:59 p.m. PST. If you have questions
>> let me know.
>>
>> Thanks!
>> G
>>
>> On Fri, 1 Jun 2018 at 12:19, Griselda Cuevas  wrote:
>>
>>> Hi Everyone,
>>>
>>> Here's
>>> 
>>> [1] the template for the June Beam Newsletter.
>>>
>>> *Add the updates you want to share with the community by 6/5 11:59 p.m.*
>>> *Pacific Time.*
>>>
>>> I'll edit and send the final version through the users mailing list on
>>> 6/7.
>>>
>>> Thank you!
>>> Gris
>>>
>>> [1]
>>> https://docs.google.com/document/d/1BwRhOu-uDd3SLB_Om_Beke5RoGKos4hj7Ljh7zM2YIo/edit
>>>
>>


Re: SDK Harness Deployment

2018-06-06 Thread Henning Rohde
Thanks for writing down and explaining the problem, Thomas. Let me try to
tease some of the topics apart.

First, the basic setup is currently as follows: there are 2 worker
processes (A) "SDK harness" and (B) "Runner harness" that needs to
communicate. A connects to B. The fundamental endpoint(s) of B as well as
an id -- logging, provisioning, artifacts and control -- are provided to A
via command line parameters. A is not expected to be able to connect to the
control port without first obtaining pipeline options (from provisioning)
and staged files (from artifacts). As an side, this is where the separate
boot.go code comes in handy. A can assume it will be restarted, if it
exits. A does not assume the given endpoints are up when started and should
make blocking calls with timeout (but if not and exits, it is restarted
anyway and will retry). Note that the data plane endpoints are part of the
control instructions and need not be known or allocated at startup or even
be served by the same TM.

Second, whether or not docker is used is rather an implementation detail,
but if we use Kubernetes (or other such options) then some constraints come
into play.

Either way, two scenarios work well:
   (1) B starts A: The ULR and Flink prototype does this. B will delay
starting A until it has decided which endpoints to use. This approach
requires B to do process/container management, which we'd rather not have
to do at scale. But it's convenient for local runners.
   (2) B has its (local) endpoints configured or fixed: A and B can be
started concurrently. Dataflow does this. Kubernetes lends itself well to
this approach (and handles container management for us).

The Flink on Kubernetes scenario described above doesn't:
   (3) B must use randomized (local) endpoints _and_ A and B are started
concurrently: A would not know where to connect.

Perhaps I'm not understanding the constraints of the TM well enough, but
can we really not open a configured/fixed port from the TM -- especially in
a network-isolated Kubernetes pod? Adding a third process (C) "proxy" to
the pod might by an alternative option and morph (3) into (2). B would
configure C when it's ready. A would connect to C, but be blocked until B
has configured it. C could perhaps even serve logging, provisioning, and
artifacts without B. And the data plane would not go over C anyway. If
control proxy'ing is a concern, then alternatively we would add an
indirection to the container contract and provide the control endpoint in
the provisioning api, say, or even a new discovery service.

There are of course other options and tradeoffs, but having Flink work on
Kubernetes and not go against the grain seems desirable to me.

Thanks,
 Henning


On Wed, Jun 6, 2018 at 10:11 AM Thomas Weise  wrote:

> Hi,
>
> The current plan for running the SDK harness is to execute docker to
> launch SDK containers with service endpoints provided by the runner in the
> docker command line.
>
> In the case of Flink runner (prototype), the service endpoints are
> dynamically allocated per executable stage. There is typically one Flink
> task manager running per machine. Each TM has multiple task slots. A subset
> of these task slots will run the Beam executable stages. Flink allows
> multiple jobs in one TM, so we could have executable stages of different
> pipelines running in a single TM, depending on how users deploy. The
> prototype also has no cleanup for the SDK containers, they remain running
> and orphaned once the runner is gone.
>
> I'm trying to find out how this approach can be augmented for deployment
> on Kubernetes. Our deployments won't allow multiple jobs per task manager,
> so all task slots will belong to the same pipeline context. The intent is
> to deploy SDK harness containers along with TMs in the same pod. No
> assumption can be made about the order in which the containers are started,
> and the SDK container wouldn't know the connect address at startup (it can
> only be discovered after the pipeline gets deployed into the TMs).
>
> I talked about that a while ago with Henning and one idea was to set a
> fixed endpoint address so that the boot code in the SDK container knows
> upfront where to connect to, even when that endpoint isn't available yet.
> This approach may work with minimal changes to runner and little or no
> change to SDK container (as long as the SDK is prepared to retry). The
> downside is that all (parallel) task slots of the TM will use the same SDK
> worker, which will likely lead to performance issues, at least with the
> Python SDK that we are planning to use.
>
> An alternative may be to define an SDK worker pool per pod, with a
> discovery mechanism for workers to find the runner endpoints and a
> coordination mechanism that distributes the dynamically allocated endpoints
> that are provided by the executable stage task slots over the available
> workers.
>
> Any thoughts on this? Is anyone else looking at a docker free deployment?
>
> Tha

Apache Beam Contribution Guide Improvements

2018-06-06 Thread Alan Myrvold
I've written up a brief document with ideas for Apache Beam contribution
guide improvements. I'm most interest in clarifying the goals of this
guide, including whether documenting on Windows is worth describing, and
what areas are missing.

Feedback welcome, especially comments / suggestions in the document,

https://docs.google.com/document/d/1zukoPXPgUq3Vli_rOJ0ykzK6NbR6g-FrgSHZjLd23bo/edit?usp=sharing

Alan Myrvold


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Jean-Baptiste Onofré
The tag is created by the plugin, but it's not pushed on remote.

I had to do:

git push apache v2.5.0.RC1

And yes, I created the branch "manually".

I also did a mvn versions:set on master to update the pom.xml, but not
on the branch (as I focused on gradle release).

Regards
JB

On 06/06/2018 19:29, Pablo Estrada wrote:
> This is because the release plugin that we went with[1] only produces
> release (and release candidate) tags, but it does not create a new
> branch, so the branch itself had to be created manually.
> 
> There were other plugins with the extra branching functionality[2], but
> we decided to use the more popular one.
> 
> [1] https://github.com/researchgate/gradle-release
> [2] https://github.com/nebula-plugins/nebula-release-plugin 
> 
> Best
> -P.
> 
> On Wed, Jun 6, 2018 at 10:21 AM Lukasz Cwik  > wrote:
> 
> I was under the impression that a "release" like plugin was added
> and it was meant to generate the git tag and do other release
> related tasks:
> 
> https://github.com/apache/beam/blob/72cbd99d6b62bc7ed16dbd1288cd61d54e8bda37/build.gradle#L181
> 
> Pablo / Ahmet, do you have more information as it doesn't seem to be
> working?
> 
> On Wed, Jun 6, 2018 at 9:49 AM Jean-Baptiste Onofré  > wrote:
> 
> Hi Scott,
> 
> it contains --no-parallel to test the release, but not to upload
> artifacts.
> 
> --no-daemon is also a must have for all steps.
> 
> There are some missing manual steps to add like pushing the git tag,
> generating the javadoc, ...
> 
> I will update the release guide PR based on what I did for the
> 2.5.0.RC1
> release.
> 
> Regards
> JB
> 
> On 06/06/2018 18:45, Scott Wegner wrote:
> > Tim and Boyuan were previously discussing similar issues in
> the Slack
> > channel [1], and the root cause was related to JAR corruption
> by the
> > signing plugin when using parallel builds. There was also some
> > investigation in BEAM-4328 [2].
> >
> > I believe fixes for all known-issues are now merged. The
> Gradle-based
> > release guide is in review [3] but already includes
> instructions for
> > --no-parallel. If there are still additional issues we should
> create
> > JIRAs for them.
> >
> >
> [1] https://the-asf.slack.com/archives/C9H0YNP3P/p1526496272000381 
> > [2] https://issues.apache.org/jira/browse/BEAM-4328 
> > [3] https://github.com/apache/beam-site/pull/424 
> >
> > On Wed, Jun 6, 2018 at 9:13 AM Robert Bradshaw
> mailto:rober...@google.com>
> > >> wrote:
> >
> >     Are there JIRAs filed for these? I have yet to have a
> corrupt cache,
> >     but it would be nice to know how to avoid and fix it.
> >     Did --no-parallel make the ErrorProne error go away? 
> >
> >     On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau
> >     mailto:rmannibu...@gmail.com>
> >>
> wrote:
> >
> >         Also maybe deactivate the daemon (--no-daemon) since
> its cache
> >         can get corrupted ~easily.
> >
> >         Romain Manni-Bucau
> >         @rmannibucau  |  Blog
> >          | Old Blog
> >          | Github
> >          | LinkedIn
> >          | Book
> >       
>  
> 
> >
> >
> >         Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré
> >         mailto:j...@nanthrax.net>
> >> a écrit :
> >
> >             It looks better with --no-parallel
> >
> >             Regards
> >             JB
> >
> >             On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
> >             > New issue during:
> >             >
> >             > ./gradlew publish -PisRelease
> >             >
> >             >> Task :beam-runners-apex:compileTestJava
> >             >
> >           
>  
> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
> >             > error: An unhandled exception was thrown by the
> Error
> >             Prone static
> >             > analysis 

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Pablo Estrada
This is because the release plugin that we went with[1] only produces
release (and release candidate) tags, but it does not create a new branch,
so the branch itself had to be created manually.

There were other plugins with the extra branching functionality[2], but we
decided to use the more popular one.

[1] https://github.com/researchgate/gradle-release
[2] https://github.com/nebula-plugins/nebula-release-plugin

Best
-P.

On Wed, Jun 6, 2018 at 10:21 AM Lukasz Cwik  wrote:

> I was under the impression that a "release" like plugin was added and it
> was meant to generate the git tag and do other release related tasks:
>
> https://github.com/apache/beam/blob/72cbd99d6b62bc7ed16dbd1288cd61d54e8bda37/build.gradle#L181
>
> Pablo / Ahmet, do you have more information as it doesn't seem to be
> working?
>
> On Wed, Jun 6, 2018 at 9:49 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Scott,
>>
>> it contains --no-parallel to test the release, but not to upload
>> artifacts.
>>
>> --no-daemon is also a must have for all steps.
>>
>> There are some missing manual steps to add like pushing the git tag,
>> generating the javadoc, ...
>>
>> I will update the release guide PR based on what I did for the 2.5.0.RC1
>> release.
>>
>> Regards
>> JB
>>
>> On 06/06/2018 18:45, Scott Wegner wrote:
>> > Tim and Boyuan were previously discussing similar issues in the Slack
>> > channel [1], and the root cause was related to JAR corruption by the
>> > signing plugin when using parallel builds. There was also some
>> > investigation in BEAM-4328 [2].
>> >
>> > I believe fixes for all known-issues are now merged. The Gradle-based
>> > release guide is in review [3] but already includes instructions for
>> > --no-parallel. If there are still additional issues we should create
>> > JIRAs for them.
>> >
>> > [1] https://the-asf.slack.com/archives/C9H0YNP3P/p1526496272000381
>> > [2] https://issues.apache.org/jira/browse/BEAM-4328
>> > [3] https://github.com/apache/beam-site/pull/424
>> >
>> > On Wed, Jun 6, 2018 at 9:13 AM Robert Bradshaw > > > wrote:
>> >
>> > Are there JIRAs filed for these? I have yet to have a corrupt cache,
>> > but it would be nice to know how to avoid and fix it.
>> > Did --no-parallel make the ErrorProne error go away?
>> >
>> > On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau
>> > mailto:rmannibu...@gmail.com>> wrote:
>> >
>> > Also maybe deactivate the daemon (--no-daemon) since its cache
>> > can get corrupted ~easily.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau  |  Blog
>> >  | Old Blog
>> >  | Github
>> >  | LinkedIn
>> >  | Book
>> > <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> > Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> a écrit :
>> >
>> > It looks better with --no-parallel
>> >
>> > Regards
>> > JB
>> >
>> > On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
>> > > New issue during:
>> > >
>> > > ./gradlew publish -PisRelease
>> > >
>> > >> Task :beam-runners-apex:compileTestJava
>> > >
>> >
>>  
>> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
>> > > error: An unhandled exception was thrown by the Error
>> > Prone static
>> > > analysis plugin.
>> > >   public static class StandardStateInternalsTests extends
>> > > StateInternalsTest {
>> > > ^
>> > >  Please report this at
>> > > https://github.com/google/error-prone/issues/new and
>> > include the following:
>> > >
>> > >  error-prone version: 2.3.1
>> > >  BugPattern: HidingField
>> > >  Stack Trace:
>> > >  com.sun.tools.javac.code.ClassFinder$BadClassFile:
>> > bad class file:
>> > >
>> >
>>  
>> /home/jbonofre/Workspace/beam/runners/core-java/build/libs/beam-runners-core-java-2.5.0-tests.jar(/org/apache/beam/runners/core/StateInternalsTest$MapEntry.class)
>> > > unable to access file: java.io.EOFException:
>> > Unexpected end of ZLIB
>> > > input stream
>> > > Please remove or make sure it appears in the correct
>> > subdirectory of
>> > > the classpath.
>> > > at
>> > >
>> >
>>  com.sun.tools.javac.jvm.ClassReader.badClassFile(ClassReader.java:298)
>> > > at
>> > >
>> >
>>  c

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Lukasz Cwik
I was under the impression that a "release" like plugin was added and it
was meant to generate the git tag and do other release related tasks:
https://github.com/apache/beam/blob/72cbd99d6b62bc7ed16dbd1288cd61d54e8bda37/build.gradle#L181

Pablo / Ahmet, do you have more information as it doesn't seem to be
working?

On Wed, Jun 6, 2018 at 9:49 AM Jean-Baptiste Onofré  wrote:

> Hi Scott,
>
> it contains --no-parallel to test the release, but not to upload artifacts.
>
> --no-daemon is also a must have for all steps.
>
> There are some missing manual steps to add like pushing the git tag,
> generating the javadoc, ...
>
> I will update the release guide PR based on what I did for the 2.5.0.RC1
> release.
>
> Regards
> JB
>
> On 06/06/2018 18:45, Scott Wegner wrote:
> > Tim and Boyuan were previously discussing similar issues in the Slack
> > channel [1], and the root cause was related to JAR corruption by the
> > signing plugin when using parallel builds. There was also some
> > investigation in BEAM-4328 [2].
> >
> > I believe fixes for all known-issues are now merged. The Gradle-based
> > release guide is in review [3] but already includes instructions for
> > --no-parallel. If there are still additional issues we should create
> > JIRAs for them.
> >
> > [1] https://the-asf.slack.com/archives/C9H0YNP3P/p1526496272000381
> > [2] https://issues.apache.org/jira/browse/BEAM-4328
> > [3] https://github.com/apache/beam-site/pull/424
> >
> > On Wed, Jun 6, 2018 at 9:13 AM Robert Bradshaw  > > wrote:
> >
> > Are there JIRAs filed for these? I have yet to have a corrupt cache,
> > but it would be nice to know how to avoid and fix it.
> > Did --no-parallel make the ErrorProne error go away?
> >
> > On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau
> > mailto:rmannibu...@gmail.com>> wrote:
> >
> > Also maybe deactivate the daemon (--no-daemon) since its cache
> > can get corrupted ~easily.
> >
> > Romain Manni-Bucau
> > @rmannibucau  |  Blog
> >  | Old Blog
> >  | Github
> >  | LinkedIn
> >  | Book
> > <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> >
> >
> > Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré
> > mailto:j...@nanthrax.net>> a écrit :
> >
> > It looks better with --no-parallel
> >
> > Regards
> > JB
> >
> > On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
> > > New issue during:
> > >
> > > ./gradlew publish -PisRelease
> > >
> > >> Task :beam-runners-apex:compileTestJava
> > >
> >
>  
> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
> > > error: An unhandled exception was thrown by the Error
> > Prone static
> > > analysis plugin.
> > >   public static class StandardStateInternalsTests extends
> > > StateInternalsTest {
> > > ^
> > >  Please report this at
> > > https://github.com/google/error-prone/issues/new and
> > include the following:
> > >
> > >  error-prone version: 2.3.1
> > >  BugPattern: HidingField
> > >  Stack Trace:
> > >  com.sun.tools.javac.code.ClassFinder$BadClassFile:
> > bad class file:
> > >
> >
>  
> /home/jbonofre/Workspace/beam/runners/core-java/build/libs/beam-runners-core-java-2.5.0-tests.jar(/org/apache/beam/runners/core/StateInternalsTest$MapEntry.class)
> > > unable to access file: java.io.EOFException:
> > Unexpected end of ZLIB
> > > input stream
> > > Please remove or make sure it appears in the correct
> > subdirectory of
> > > the classpath.
> > > at
> > >
> >
>  com.sun.tools.javac.jvm.ClassReader.badClassFile(ClassReader.java:298)
> > > at
> > >
> >
>  com.sun.tools.javac.jvm.ClassReader.readClassFile(ClassReader.java:2830)
> > >
> > > I'm trying with --no-parallel.
> > >
> > > Regards
> > > JB
> > >
> > > On 06/06/2018 05:18, Lukasz Cwik wrote:
> > >> JB, I believe many people are waiting on the release to
> > happen and the
> > >> release branch is yet to be cut. It has been almost a
> > week since you
> > >> said you would cut the release branch. It seems like your
> > very busy, can
> > 

Re: Announcement & Proposal: HDFS tests on large cluster.

2018-06-06 Thread Pablo Estrada
This is really cool!

+1 for having a cluster with more than one machine run the test.

-P.

On Wed, Jun 6, 2018 at 9:57 AM Chamikara Jayalath 
wrote:

> On Wed, Jun 6, 2018 at 5:19 AM Łukasz Gajowy 
> wrote:
>
>> Hi all,
>>
>> I'd like to announce that thanks to Kamil Szewczyk, since this PR
>>  we have 4 file-based HDFS
>> tests run on a "Large HDFS Cluster"! More specifically I mean:
>>
>> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
>> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
>> - beam_PerformanceTests_AvroIOIT_HDFS
>> - beam_PerformanceTests_XmlIOIT_HDFS
>>
>> The "Large HDFS Cluster" (in contrast to the small one, that is also
>> available) consists of a master node and three data nodes all in separate
>> pods. Thanks to that we can mimic more real-life scenarios on HDFS (3
>> distributed nodes) and possibly run bigger tests so there's progress! :)
>>
>>
> This is great. Also, looks like results are available in test dashboard:
> https://apache-beam-testing.appspot.com/explore?dashboard=5755685136498688
> (BTW we should add information about dashboard to the testing doc:
> https://beam.apache.org/contribute/testing/)
>
> I'm currently working on proper documentation for this so that everyone
>> can use it in IOITs (stay tuned).
>>
>> Regarding the above, I'd like to propose scaling up the
>> Kubernetes cluster. AFAIK, currently, it consists of 1 node. If we scale it
>> up to eg. 3 nodes, the HDFS' kubernetes pods will distribute themselves on
>> different machines rather than one, making it an even more "real-life"
>> scenario (possibly more efficient?). Moreover, other Performance Tests
>> (such as JDBC or mongo) could use more space for their infrastructure as
>> well. Scaling up the cluster could also turn out useful for some future
>> efforts, like BEAM-4508[1] (adapting and running some old IOITs on
>> Jenkins).
>>
>> WDYT? Are there any objections?
>>
> +1 for increasing the size of Kubernetes cluster.
>
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-4508
>>
>> --
Got feedback? go/pabloem-feedback


Re: Read from a Google Sheet based BigQuery table - Python SDK

2018-06-06 Thread Chamikara Jayalath
On Tue, Jun 5, 2018 at 9:56 PM Leonardo Biagioli 
wrote:

> Hi Cham,
>
> thanks but those pages are related to the authentication inside Google
> Cloud Platform services, I need to authenticate the job on Sheets… Since
> that the required scope is https://www.googleapis.com/auth/drive is there
> a way to pass it in the deployment phase of a Dataflow job?
>

I haven't tried this unfortunately so not sure if this will work. Are you
able to run queries against your federated table using BQ dashboard
(without using Dataflow) ? Also make sure that compute engine service
account used by Dataflow job is properly authenticated (as mentioned in the
document I provided). I recommend contacting Google cloud support for
questions regarding BQ and Dataflow services.

- Cham


> Thank you,
>
> Leonardo
>
>
>
> *Da:* Chamikara Jayalath 
> *Inviato:* martedì 5 giugno 2018 19:26
> *A:* u...@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Oggetto:* Re: Read from a Google Sheet based BigQuery table - Python SDK
>
>
>
> See following regarding authenticating Dataflow jobs.
>
> https://cloud.google.com/dataflow/security-and-permissions
>
>
>
> I'm not sure about information specific to sheets, seems like there's some
> info in following.
>
> https://cloud.google.com/bigquery/external-data-drive
>
>
>
> On Tue, Jun 5, 2018 at 10:16 AM Leonardo Biagioli 
> wrote:
>
> Hi Cham,
>
> Thank you for taking time to answer!
>
> Is there a way to authenticate properly a Beam job on Dataflow runner? I
> should specify the required scope to read from Sheets, but where I can set
> that parameter?
>
> Regards,
>
> Leonardo
>
>
>
> Il 05 giu 2018 18:28, Chamikara Jayalath  ha
> scritto:
>
> I don't think BQ federated tables support export jobs so reading directly
> from such tables likely will not work. But reading using a query should
> work if your job is authenticated properly  (I haven't tested this).
>
>
>
> - Cham
>
>
>
> On Tue, Jun 5, 2018, 5:56 AM Leonardo Biagioli 
> wrote:
>
> Hi guys,
>
> just wanted to ask you if there is a chance to read from a Sheet based
> BigQuery table from a Beam pipeline running on Dataflow…
>
> I usually specify additional scopes to use through the authentication when
> running simple Python code to do the same, but I wasn’t able to find a
> reference to something similar for Beam.
>
> Could you please help?
>
> Thank you very much!
>
> Leonardo
>
>
>
>


SDK Harness Deployment

2018-06-06 Thread Thomas Weise
Hi,

The current plan for running the SDK harness is to execute docker to launch
SDK containers with service endpoints provided by the runner in the docker
command line.

In the case of Flink runner (prototype), the service endpoints are
dynamically allocated per executable stage. There is typically one Flink
task manager running per machine. Each TM has multiple task slots. A subset
of these task slots will run the Beam executable stages. Flink allows
multiple jobs in one TM, so we could have executable stages of different
pipelines running in a single TM, depending on how users deploy. The
prototype also has no cleanup for the SDK containers, they remain running
and orphaned once the runner is gone.

I'm trying to find out how this approach can be augmented for deployment on
Kubernetes. Our deployments won't allow multiple jobs per task manager, so
all task slots will belong to the same pipeline context. The intent is to
deploy SDK harness containers along with TMs in the same pod. No assumption
can be made about the order in which the containers are started, and the
SDK container wouldn't know the connect address at startup (it can only be
discovered after the pipeline gets deployed into the TMs).

I talked about that a while ago with Henning and one idea was to set a
fixed endpoint address so that the boot code in the SDK container knows
upfront where to connect to, even when that endpoint isn't available yet.
This approach may work with minimal changes to runner and little or no
change to SDK container (as long as the SDK is prepared to retry). The
downside is that all (parallel) task slots of the TM will use the same SDK
worker, which will likely lead to performance issues, at least with the
Python SDK that we are planning to use.

An alternative may be to define an SDK worker pool per pod, with a
discovery mechanism for workers to find the runner endpoints and a
coordination mechanism that distributes the dynamically allocated endpoints
that are provided by the executable stage task slots over the available
workers.

Any thoughts on this? Is anyone else looking at a docker free deployment?

Thanks,
Thomas


Re: Announcement & Proposal: HDFS tests on large cluster.

2018-06-06 Thread Chamikara Jayalath
On Wed, Jun 6, 2018 at 5:19 AM Łukasz Gajowy 
wrote:

> Hi all,
>
> I'd like to announce that thanks to Kamil Szewczyk, since this PR
>  we have 4 file-based HDFS
> tests run on a "Large HDFS Cluster"! More specifically I mean:
>
> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
> - beam_PerformanceTests_AvroIOIT_HDFS
> - beam_PerformanceTests_XmlIOIT_HDFS
>
> The "Large HDFS Cluster" (in contrast to the small one, that is also
> available) consists of a master node and three data nodes all in separate
> pods. Thanks to that we can mimic more real-life scenarios on HDFS (3
> distributed nodes) and possibly run bigger tests so there's progress! :)
>
>
This is great. Also, looks like results are available in test dashboard:
https://apache-beam-testing.appspot.com/explore?dashboard=5755685136498688
(BTW we should add information about dashboard to the testing doc:
https://beam.apache.org/contribute/testing/)

I'm currently working on proper documentation for this so that everyone can
> use it in IOITs (stay tuned).
>
> Regarding the above, I'd like to propose scaling up the
> Kubernetes cluster. AFAIK, currently, it consists of 1 node. If we scale it
> up to eg. 3 nodes, the HDFS' kubernetes pods will distribute themselves on
> different machines rather than one, making it an even more "real-life"
> scenario (possibly more efficient?). Moreover, other Performance Tests
> (such as JDBC or mongo) could use more space for their infrastructure as
> well. Scaling up the cluster could also turn out useful for some future
> efforts, like BEAM-4508[1] (adapting and running some old IOITs on
> Jenkins).
>
> WDYT? Are there any objections?
>
+1 for increasing the size of Kubernetes cluster.

>
> [1] https://issues.apache.org/jira/browse/BEAM-4508
>
>


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Jean-Baptiste Onofré
Hi Scott,

it contains --no-parallel to test the release, but not to upload artifacts.

--no-daemon is also a must have for all steps.

There are some missing manual steps to add like pushing the git tag,
generating the javadoc, ...

I will update the release guide PR based on what I did for the 2.5.0.RC1
release.

Regards
JB

On 06/06/2018 18:45, Scott Wegner wrote:
> Tim and Boyuan were previously discussing similar issues in the Slack
> channel [1], and the root cause was related to JAR corruption by the
> signing plugin when using parallel builds. There was also some
> investigation in BEAM-4328 [2].
> 
> I believe fixes for all known-issues are now merged. The Gradle-based
> release guide is in review [3] but already includes instructions for
> --no-parallel. If there are still additional issues we should create
> JIRAs for them.
> 
> [1] https://the-asf.slack.com/archives/C9H0YNP3P/p1526496272000381 
> [2] https://issues.apache.org/jira/browse/BEAM-4328 
> [3] https://github.com/apache/beam-site/pull/424 
> 
> On Wed, Jun 6, 2018 at 9:13 AM Robert Bradshaw  > wrote:
> 
> Are there JIRAs filed for these? I have yet to have a corrupt cache,
> but it would be nice to know how to avoid and fix it.
> Did --no-parallel make the ErrorProne error go away? 
> 
> On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau
> mailto:rmannibu...@gmail.com>> wrote:
> 
> Also maybe deactivate the daemon (--no-daemon) since its cache
> can get corrupted ~easily.
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
> 
> 
> 
> Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>> a écrit :
> 
> It looks better with --no-parallel
> 
> Regards
> JB
> 
> On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
> > New issue during:
> >
> > ./gradlew publish -PisRelease
> >
> >> Task :beam-runners-apex:compileTestJava
> >
> 
> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
> > error: An unhandled exception was thrown by the Error
> Prone static
> > analysis plugin.
> >   public static class StandardStateInternalsTests extends
> > StateInternalsTest {
> >                 ^
> >      Please report this at
> > https://github.com/google/error-prone/issues/new and
> include the following:
> >
> >      error-prone version: 2.3.1
> >      BugPattern: HidingField
> >      Stack Trace:
> >      com.sun.tools.javac.code.ClassFinder$BadClassFile:
> bad class file:
> >
> 
> /home/jbonofre/Workspace/beam/runners/core-java/build/libs/beam-runners-core-java-2.5.0-tests.jar(/org/apache/beam/runners/core/StateInternalsTest$MapEntry.class)
> >     unable to access file: java.io.EOFException:
> Unexpected end of ZLIB
> > input stream
> >     Please remove or make sure it appears in the correct
> subdirectory of
> > the classpath.
> >         at
> >
> 
> com.sun.tools.javac.jvm.ClassReader.badClassFile(ClassReader.java:298)
> >         at
> >
> 
> com.sun.tools.javac.jvm.ClassReader.readClassFile(ClassReader.java:2830)
> >
> > I'm trying with --no-parallel.
> >
> > Regards
> > JB
> >
> > On 06/06/2018 05:18, Lukasz Cwik wrote:
> >> JB, I believe many people are waiting on the release to
> happen and the
> >> release branch is yet to be cut. It has been almost a
> week since you
> >> said you would cut the release branch. It seems like your
> very busy, can
> >> you explain clearly what is slowing you down so people
> can help or would
> >> you like to defer doing the release at this point in time?
> >>
> >> On Tue, Jun 5, 2018 at 10:16 AM Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>
> >> >> wrote:
> >>
> >>     Thanks, yeah I saw that but I'm planning to add some
> additional notes.
> >>
> >>     Reg

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Scott Wegner
Tim and Boyuan were previously discussing similar issues in the Slack
channel [1], and the root cause was related to JAR corruption by the
signing plugin when using parallel builds. There was also some
investigation in BEAM-4328 [2].

I believe fixes for all known-issues are now merged. The Gradle-based
release guide is in review [3] but already includes instructions for
--no-parallel. If there are still additional issues we should create JIRAs
for them.

[1] https://the-asf.slack.com/archives/C9H0YNP3P/p1526496272000381
[2] https://issues.apache.org/jira/browse/BEAM-4328
[3] https://github.com/apache/beam-site/pull/424

On Wed, Jun 6, 2018 at 9:13 AM Robert Bradshaw  wrote:

> Are there JIRAs filed for these? I have yet to have a corrupt cache, but
> it would be nice to know how to avoid and fix it. Did --no-parallel make
> the ErrorProne error go away?
>
> On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau 
> wrote:
>
>> Also maybe deactivate the daemon (--no-daemon) since its cache can get
>> corrupted ~easily.
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré  a
>> écrit :
>>
>>> It looks better with --no-parallel
>>>
>>> Regards
>>> JB
>>>
>>> On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
>>> > New issue during:
>>> >
>>> > ./gradlew publish -PisRelease
>>> >
>>> >> Task :beam-runners-apex:compileTestJava
>>> >
>>> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
>>> > error: An unhandled exception was thrown by the Error Prone static
>>> > analysis plugin.
>>> >   public static class StandardStateInternalsTests extends
>>> > StateInternalsTest {
>>> > ^
>>> >  Please report this at
>>> > https://github.com/google/error-prone/issues/new and include the
>>> following:
>>> >
>>> >  error-prone version: 2.3.1
>>> >  BugPattern: HidingField
>>> >  Stack Trace:
>>> >  com.sun.tools.javac.code.ClassFinder$BadClassFile: bad class file:
>>> >
>>> /home/jbonofre/Workspace/beam/runners/core-java/build/libs/beam-runners-core-java-2.5.0-tests.jar(/org/apache/beam/runners/core/StateInternalsTest$MapEntry.class)
>>> > unable to access file: java.io.EOFException: Unexpected end of ZLIB
>>> > input stream
>>> > Please remove or make sure it appears in the correct subdirectory
>>> of
>>> > the classpath.
>>> > at
>>> > com.sun.tools.javac.jvm.ClassReader.badClassFile(ClassReader.java:298)
>>> > at
>>> >
>>> com.sun.tools.javac.jvm.ClassReader.readClassFile(ClassReader.java:2830)
>>> >
>>> > I'm trying with --no-parallel.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 06/06/2018 05:18, Lukasz Cwik wrote:
>>> >> JB, I believe many people are waiting on the release to happen and the
>>> >> release branch is yet to be cut. It has been almost a week since you
>>> >> said you would cut the release branch. It seems like your very busy,
>>> can
>>> >> you explain clearly what is slowing you down so people can help or
>>> would
>>> >> you like to defer doing the release at this point in time?
>>> >>
>>> >> On Tue, Jun 5, 2018 at 10:16 AM Jean-Baptiste Onofré >> >> > wrote:
>>> >>
>>> >> Thanks, yeah I saw that but I'm planning to add some additional
>>> notes.
>>> >>
>>> >> Regards
>>> >> JB
>>> >> Le 5 juin 2018, à 19:02, Boyuan Zhang >> >> > a écrit:
>>> >>
>>> >> Hey JB,
>>> >>
>>> >> We have some updates in :
>>> >>
>>> https://github.com/pabloem/beam-site/blob/372c93ecbafbf3a1440492df1e12050dfe939e91/src/contribute/release-guide.md
>>> >> <
>>> https://github.com/pabloem/beam-site/blob/372c93ecbafbf3a1440492df1e12050dfe939e91/src/contribute/release-guide.md
>>> >and
>>> >> https://github.com/apache/beam-site/pull/424/files
>>> >> <
>>> https://www.google.com/url?q=https://github.com/apache/beam-site/pull/424/files&sa=D&usg=AFQjCNHm8O1DwRKZ1828EQxzmx881O5aWA
>>> >.
>>> >> They may be helpful.
>>> >>
>>> >> Boyuan
>>> >>
>>> >> On Tue, Jun 5, 2018 at 9:50 AM Jean-Baptiste Onofré <
>>> >> j...@nanthrax.net > wrote:
>>> >>
>>> >> Was checking if there was something like the release
>>> branch
>>> >> of the maven release plugin, but there's not with the
>>> gradle
>>> >> one.
>>> >>
>>> >> I'm creating the branch by hand and I'm updating the
>>> release
>>> >> guide in the mean time.
>>> >>
>>> >> Regards
>>> >> JB
>>> >> Le 5 juin 2018, à 18:44, "Jean-Baptis

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Jean-Baptiste Onofré
I updated dist.apache.org dev with Python distribution.

Regards
JB

On 06/06/2018 18:19, Jean-Baptiste Onofré wrote:
> Hi Robert,
> 
> sorry, I missed this step, let me add on dist.apache.org.
> 
> Thanks for the catch and sorry about that !
> 
> Regards
> JB
> 
> On 06/06/2018 18:06, Robert Bradshaw wrote:
>> Thank you JB! Glad to see this finally rolling out. I don't see the
>> Python artifacts, did you mean to stage them
>> in https://dist.apache.org/repos/dist/dev/beam/2.5.0/? If you want help
>> building wheels, let me know. 
>>
>>
>>
>> On Wed, Jun 6, 2018 at 1:50 AM Etienne Chauchot > > wrote:
>>
>> Thanks JB for all your work ! I believe doing the first gradle
>> release must have been hard.
>> I'll run Nexmark on the release and keep you posted.
>>
>> Best 
>> Etienne
>>
>>
>> Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version
>>> 2.5.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> NB: this is the first release using Gradle, so don't be too harsh ;) A
>>> PR about the release guide will follow thanks to this release.
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org 
>>> 
>>> [2], which is signed with the key with fingerprint C8282E76 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.5.0-RC1" [5],
>>> * website pull request listing the release and publishing the API
>>> reference manual [6].
>>> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
>>> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org  [2].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> JB
>>>
>>> [1]
>>> 
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] 
>>> https://repository.apache.org/content/repositories/orgapachebeam-1041/
>>> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
>>> [6] https://github.com/apache/beam-site/pull/463
>>>
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Beam SQL Pipeline Options

2018-06-06 Thread Andrew Pilloud
We are just about to the point of having a working pure SQL workflow for
Beam! One of the last things that remains is how to configure Pipeline
Options via a SQL shell. I have written up a proposal to use the set
statement, for example "SET runner=DataflowRunner". I'm looking for
feedback, particularly on what will make for the best user experience.
Please take a look and comment:

https://docs.google.com/document/d/1UTsSBuruJRfGnVOS9eXbQI6NauCD4WnSAPgA_Y0zjdk/edit?usp=sharing

Andrew


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Jean-Baptiste Onofré
Hi Robert,

sorry, I missed this step, let me add on dist.apache.org.

Thanks for the catch and sorry about that !

Regards
JB

On 06/06/2018 18:06, Robert Bradshaw wrote:
> Thank you JB! Glad to see this finally rolling out. I don't see the
> Python artifacts, did you mean to stage them
> in https://dist.apache.org/repos/dist/dev/beam/2.5.0/? If you want help
> building wheels, let me know. 
> 
> 
> 
> On Wed, Jun 6, 2018 at 1:50 AM Etienne Chauchot  > wrote:
> 
> Thanks JB for all your work ! I believe doing the first gradle
> release must have been hard.
> I'll run Nexmark on the release and keep you posted.
> 
> Best 
> Etienne
> 
> 
> Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the version
>> 2.5.0, as follows:
>>
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> NB: this is the first release using Gradle, so don't be too harsh ;) A
>> PR about the release guide will follow thanks to this release.
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org 
>> 
>> [2], which is signed with the key with fingerprint C8282E76 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.5.0-RC1" [5],
>> * website pull request listing the release and publishing the API
>> reference manual [6].
>> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
>> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org  [2].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> JB
>>
>> [1]
>> 
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4] 
>> https://repository.apache.org/content/repositories/orgapachebeam-1041/
>> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
>> [6] https://github.com/apache/beam-site/pull/463
>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-06-06 Thread Robert Bradshaw
Are there JIRAs filed for these? I have yet to have a corrupt cache, but it
would be nice to know how to avoid and fix it. Did --no-parallel make the
ErrorProne error go away?

On Tue, Jun 5, 2018 at 11:39 PM Romain Manni-Bucau 
wrote:

> Also maybe deactivate the daemon (--no-daemon) since its cache can get
> corrupted ~easily.
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le mer. 6 juin 2018 à 08:35, Jean-Baptiste Onofré  a
> écrit :
>
>> It looks better with --no-parallel
>>
>> Regards
>> JB
>>
>> On 06/06/2018 07:49, Jean-Baptiste Onofré wrote:
>> > New issue during:
>> >
>> > ./gradlew publish -PisRelease
>> >
>> >> Task :beam-runners-apex:compileTestJava
>> >
>> /home/jbonofre/Workspace/beam/runners/apex/src/test/java/org/apache/beam/runners/apex/translation/utils/ApexStateInternalsTest.java:55:
>> > error: An unhandled exception was thrown by the Error Prone static
>> > analysis plugin.
>> >   public static class StandardStateInternalsTests extends
>> > StateInternalsTest {
>> > ^
>> >  Please report this at
>> > https://github.com/google/error-prone/issues/new and include the
>> following:
>> >
>> >  error-prone version: 2.3.1
>> >  BugPattern: HidingField
>> >  Stack Trace:
>> >  com.sun.tools.javac.code.ClassFinder$BadClassFile: bad class file:
>> >
>> /home/jbonofre/Workspace/beam/runners/core-java/build/libs/beam-runners-core-java-2.5.0-tests.jar(/org/apache/beam/runners/core/StateInternalsTest$MapEntry.class)
>> > unable to access file: java.io.EOFException: Unexpected end of ZLIB
>> > input stream
>> > Please remove or make sure it appears in the correct subdirectory of
>> > the classpath.
>> > at
>> > com.sun.tools.javac.jvm.ClassReader.badClassFile(ClassReader.java:298)
>> > at
>> > com.sun.tools.javac.jvm.ClassReader.readClassFile(ClassReader.java:2830)
>> >
>> > I'm trying with --no-parallel.
>> >
>> > Regards
>> > JB
>> >
>> > On 06/06/2018 05:18, Lukasz Cwik wrote:
>> >> JB, I believe many people are waiting on the release to happen and the
>> >> release branch is yet to be cut. It has been almost a week since you
>> >> said you would cut the release branch. It seems like your very busy,
>> can
>> >> you explain clearly what is slowing you down so people can help or
>> would
>> >> you like to defer doing the release at this point in time?
>> >>
>> >> On Tue, Jun 5, 2018 at 10:16 AM Jean-Baptiste Onofré > >> > wrote:
>> >>
>> >> Thanks, yeah I saw that but I'm planning to add some additional
>> notes.
>> >>
>> >> Regards
>> >> JB
>> >> Le 5 juin 2018, à 19:02, Boyuan Zhang > >> > a écrit:
>> >>
>> >> Hey JB,
>> >>
>> >> We have some updates in :
>> >>
>> https://github.com/pabloem/beam-site/blob/372c93ecbafbf3a1440492df1e12050dfe939e91/src/contribute/release-guide.md
>> >> <
>> https://github.com/pabloem/beam-site/blob/372c93ecbafbf3a1440492df1e12050dfe939e91/src/contribute/release-guide.md
>> >and
>> >> https://github.com/apache/beam-site/pull/424/files
>> >> <
>> https://www.google.com/url?q=https://github.com/apache/beam-site/pull/424/files&sa=D&usg=AFQjCNHm8O1DwRKZ1828EQxzmx881O5aWA
>> >.
>> >> They may be helpful.
>> >>
>> >> Boyuan
>> >>
>> >> On Tue, Jun 5, 2018 at 9:50 AM Jean-Baptiste Onofré <
>> >> j...@nanthrax.net > wrote:
>> >>
>> >> Was checking if there was something like the release branch
>> >> of the maven release plugin, but there's not with the
>> gradle
>> >> one.
>> >>
>> >> I'm creating the branch by hand and I'm updating the
>> release
>> >> guide in the mean time.
>> >>
>> >> Regards
>> >> JB
>> >> Le 5 juin 2018, à 18:44, "Jean-Baptiste Onofré" <
>> >> j...@nanthrax.net > a écrit:
>> >>
>> >> On the way
>> >> Le 5 juin 2018, à 18:23, Kenneth Knowles <
>> >> k...@google.com > a écrit:
>> >>
>> >> Have you cut the release branch? It is much easier
>> >> to stabilize a cut branch that is separated from
>> >> continued development on master. I think we have to
>> >> cut it before continuing.
>> >>
>> >> Kenn
>> >>
>> >> On Tue, Jun 5, 2018 at 1:14 AM Jean-Baptiste Onofré
>> >> < j...@nanthrax.net > wrote:
>> >>
>> >> Sorry for the noise: this bu

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Robert Bradshaw
Thank you JB! Glad to see this finally rolling out. I don't see the Python
artifacts, did you mean to stage them in
https://dist.apache.org/repos/dist/dev/beam/2.5.0/? If you want help
building wheels, let me know.



On Wed, Jun 6, 2018 at 1:50 AM Etienne Chauchot 
wrote:

> Thanks JB for all your work ! I believe doing the first gradle release
> must have been hard.
> I'll run Nexmark on the release and keep you posted.
>
> Best
> Etienne
>
>
> Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
>
> Hi everyone,
>
>
> Please review and vote on the release candidate #1 for the version
>
> 2.5.0, as follows:
>
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> NB: this is the first release using Gradle, so don't be too harsh ;) A
>
> PR about the release guide will follow thanks to this release.
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release to be deployed to dist.apache.org
>
> [2], which is signed with the key with fingerprint C8282E76 [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "v2.5.0-RC1" [5],
>
> * website pull request listing the release and publishing the API
>
> reference manual [6].
>
> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
>
> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
>
> * Python artifacts are deployed along with the source release to the
>
> dist.apache.org [2].
>
>
> The vote will be open for at least 72 hours. It is adopted by majority
>
> approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
>
> JB
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
>
> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1041/
>
> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
>
> [6] https://github.com/apache/beam-site/pull/463
>
>
>


Re: The full list of proposals / prototype documents

2018-06-06 Thread Alexey Romanenko
FYI: Finally, it was merged and you can find this page here:
https://beam.apache.org/contribute/design-documents/ 


Thank you everybody who helped me to compile this list! 
I’ll try to do my best to update this with new coming docs. In the same time, 
please, feel free to add your new docs (or notify me if I missed this) once 
they are finished and ready to be published.

WBR,
Alexey

> On 31 May 2018, at 18:52, Eugene Kirpichov  wrote:
> 
> Thank you!
> 
> On Thu, May 31, 2018 at 8:30 AM Alexey Romanenko  > wrote:
> Thank you everybody for provided links. I collected all of them (please, 
> correct me if I missed something), categorized and created a dedicated page 
> for Beam website.
> 
> Here is a PR for that (please, review):
> https://github.com/apache/beam-site/pull/456 
> 
> 
> WBR,
> Alexey
> 
>> On 30 May 2018, at 13:17, Łukasz Gajowy > > wrote:
>> 
>> Hi, 
>> 
>> I just wanted to add those two (sorry for being kinda late with this): 
>> 
>> https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit?usp=sharing
>>  
>> 
>> https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA/edit?usp=sharing
>>  
>> 
>> 
>> Thanks, 
>> Łukasz 
>> 
>> 2018-05-29 22:42 GMT+02:00 Lukasz Cwik > >:
>> Providing ownership to the PMC account allows others to take over ownership 
>> of the document once a contributor stops being active. This allows docs to 
>> be updated (even if just to point to a newer doc).
>> 
>> On Tue, May 29, 2018 at 1:20 PM Kenneth Knowles > > wrote:
>> My position on ownership is design docs are really documents "of the moment" 
>> and authored by a particular individual or group. Experience shows that even 
>> if you try, keeping it fresh is not likely to happen. Anything that needs 
>> freshness (like end-user docs) should be in a different medium. I would just 
>> date the gdoc so readers know how to interpret it (the automated "last edit" 
>> date is not sufficient for understanding how stale something is). 
>> 
>> So it seems like it makes little difference if the project or PMC has 
>> ownership or even write access. Of course I have no objections if someone 
>> wants to transfer ownership, but is there a reason to encourage it?
>> 
>> Kenn
>> 
>> On Tue, May 29, 2018 at 1:11 PM Lukasz Cwik > > wrote:
>> I transferred ownership of the docs that I owned to the apacheb...@gmail.com 
>>  PMC account and put the ones that I owned into 
>> the drive folder.
>> 
>> Would it be a good idea for others to follow suit?
>> 
>> Instructions on how to transfer ownership are here: 
>> http://support.it.mtu.edu/Accounts/E-Mail/75946047/How-do-I-transfer-ownership-of-a-Google-Doc.htm
>>  
>> 
>> 
>> 
>> 
>> On Tue, May 29, 2018 at 11:23 AM Lukasz Cwik > > wrote:
>> I created a PR for the beam-site to link to the design docs and template 
>> from the contribution guide:
>> https://github.com/apache/beam-site/pull/454 
>> 
>> 
>> On Fri, May 25, 2018 at 10:23 AM Lukasz Cwik > > wrote:
>> Here are some more links related to portability efforts:
>> 
>> https://s.apache.org/beam-fn-api 
>> https://s.apache.org/beam-fn-api-processing-a-bundle 
>> 
>> https://s.apache.org/beam-fn-api-send-and-receive-data 
>> 
>> https://s.apache.org/beam-fn-state-api-and-bundle-processing 
>> 
>> https://s.apache.org/beam-fn-api-progress-reporting 
>> 
>> https://s.apache.org/beam-fn-api-container-contract 
>> 
>> https://s.apache.org/beam-breaking-fusion 
>> 
>> https://s.apache.org/beam-runner-api-combine-model 
>> 
>> https://s.apache.org/beam-fn-api-metrics 
>> 
>> 
>> 
>> On Thu, May 24, 2018 at 2:11 PM Scott Wegner > > wrote:
>> Thanks for sharing these. I also put together a design doc template based on 
>> common styling / sections I saw in the docs listed above. Others are free to 
>> use it as they'd like.
>> 
>> https://docs.google.com/document/d/1kV

Re: [Call for items] Beam June Newsletter

2018-06-06 Thread Scott Wegner
Thanks for putting this together, Gris. I'm not familiar with the format:
what time period this newsletter should cover? Is this a monthly newsletter
and the June edition covers news from May?

On Tue, Jun 5, 2018 at 4:47 PM Griselda Cuevas  wrote:

> Hi Everyone,
>
> Just a reminder to add items to the June Newsletter, the idea behind it is
> to summarize community efforts in the project to get others identify
> similar actions, opportunities to collaborate and ask questions. Folks in
> the mailing list have found these newsletters useful in the past, so let's
> keep the tradition going :)
>
> I'll extend the deadline until 6/7 11:59 p.m. PST. If you have questions
> let me know.
>
> Thanks!
> G
>
> On Fri, 1 Jun 2018 at 12:19, Griselda Cuevas  wrote:
>
>> Hi Everyone,
>>
>> Here's
>> 
>> [1] the template for the June Beam Newsletter.
>>
>> *Add the updates you want to share with the community by 6/5 11:59 p.m.*
>> *Pacific Time.*
>>
>> I'll edit and send the final version through the users mailing list on
>> 6/7.
>>
>> Thank you!
>> Gris
>>
>> [1]
>> https://docs.google.com/document/d/1BwRhOu-uDd3SLB_Om_Beke5RoGKos4hj7Ljh7zM2YIo/edit
>>
>


Announcement & Proposal: HDFS tests on large cluster.

2018-06-06 Thread Łukasz Gajowy
Hi all,

I'd like to announce that thanks to Kamil Szewczyk, since this PR
 we have 4 file-based HDFS tests
run on a "Large HDFS Cluster"! More specifically I mean:

- beam_PerformanceTests_Compressed_TextIOIT_HDFS
- beam_PerformanceTests_Compressed_TextIOIT_HDFS
- beam_PerformanceTests_AvroIOIT_HDFS
- beam_PerformanceTests_XmlIOIT_HDFS

The "Large HDFS Cluster" (in contrast to the small one, that is also
available) consists of a master node and three data nodes all in separate
pods. Thanks to that we can mimic more real-life scenarios on HDFS (3
distributed nodes) and possibly run bigger tests so there's progress! :)

I'm currently working on proper documentation for this so that everyone can
use it in IOITs (stay tuned).

Regarding the above, I'd like to propose scaling up the Kubernetes cluster.
AFAIK, currently, it consists of 1 node. If we scale it up to eg. 3 nodes,
the HDFS' kubernetes pods will distribute themselves on different machines
rather than one, making it an even more "real-life" scenario (possibly more
efficient?). Moreover, other Performance Tests (such as JDBC or mongo)
could use more space for their infrastructure as well. Scaling up the
cluster could also turn out useful for some future efforts, like BEAM-4508[1]
(adapting and running some old IOITs on Jenkins).

WDYT? Are there any objections?

[1] https://issues.apache.org/jira/browse/BEAM-4508


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Etienne Chauchot
Thanks JB for all your work ! I believe doing the first gradle release must 
have been hard.
I'll run Nexmark on the release and keep you posted.

Best 
Etienne


Le mercredi 06 juin 2018 à 10:44 +0200, Jean-Baptiste Onofré a écrit :
> Hi everyone,
> 
> Please review and vote on the release candidate #1 for the version
> 2.5.0, as follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> NB: this is the first release using Gradle, so don't be too harsh ;) A
> PR about the release guide will follow thanks to this release.
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint C8282E76 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.5.0-RC1" [5],
> * website pull request listing the release and publishing the API
> reference manual [6].
> * Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
> JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> JB
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
> [2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1041/
> [5] https://github.com/apache/beam/tree/v2.5.0-RC1
> [6] https://github.com/apache/beam-site/pull/463
> 

[VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-06 Thread Jean-Baptiste Onofré
Hi everyone,

Please review and vote on the release candidate #1 for the version
2.5.0, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

NB: this is the first release using Gradle, so don't be too harsh ;) A
PR about the release guide will follow thanks to this release.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2], which is signed with the key with fingerprint C8282E76 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.5.0-RC1" [5],
* website pull request listing the release and publishing the API
reference manual [6].
* Java artifacts were built with Gradle 4.7 (wrapper) and OpenJDK/Oracle
JDK 1.8.0_172 (Oracle Corporation 25.172-b11).
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12342847
[2] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1041/
[5] https://github.com/apache/beam/tree/v2.5.0-RC1
[6] https://github.com/apache/beam-site/pull/463