Re: Apache Beam Newsletter - September 2018

2018-09-10 Thread Griselda Cuevas
Thank you so much for helping us put this together Rose!

And thanks everyone who contributed, is so good to see how our
contributions grow and how much more events we're representing Beam at!




On Mon, 10 Sep 2018 at 19:56, Rose Nguyen  wrote:

>
> [image: Beam.png]
>
> September 2018 | Newsletter
>
>
> What’s been done
>
>
> CI improvement (by: Etienne Chauchot)
>
>-
>
>For each new commit on master Nexmark suite is run in both batch and
>streaming mode in Spark, Flink, Cloud Dataflow (thanks to Andrew) and
>dashboards graphs are produced to track functional and performance
>regressions.
>
>
> Elasticsearch IO Supports Version 6 (by: Dat Tran)
>
>-
>
>Elasticsearch IO now supports version 6.x in addition to version 2.x
>and 5.x.
>-
>
>See the merged PR
>
>for more details.
>
>
> KuduIO Added (by: Tim Robertson)
>
>-
>
>Apache Beam master now has KuduIO that will be released with Beam
>2.7.0.
>-
>
>See BEAM-2661  for
>more details.
>
>
>
> What we’re working on...
>
>
> Flink Portable Runner (by: Ankur Goenka, Maximilian Michels, Thomas
> Weise, Ryan Williams)
>
>-
>
>Support for streaming side inputs merged
>
>-
>
>Portable Compatibility Matrix tests pass in streaming mode
>
>-
>
>Many more ValidatesRunner tests pass (ValidatesRunner is a
>comprehensive suite for Beam test pipelines)
>-
>
>Python Pipelines can be tested without bringing up a JobServer first
>(it is started in a container)
>-
>
>Experimental support for executing the SDK harnesses in a process
>instead of a Docker container
>-
>
>Bug fixes to Beam discovered during working on the portability
>
> State and Timer Support in Python SDK (by: Charles Chen, Robert Bradshaw)
>
>-
>
>This change adds the reference DirectRunner implementation of the
>Python User State and Timers API. With this change, a user can execute
>DoFns with state and timers on the DirectRunner.
>-
>
>See the design doc
> and PR
> for more details..
>
>
> New IO - HadoopOutputFormatIO (by: Alexey Romanenko)
>
>-
>
>Adding support of MapReduce OutputFormat.
>-
>
>See BEAM-5310  for
>more details.
>
>
> High-level Java 8 DSL (by: David Moravek, Vaclav Plajt, Marek Simunek)
>
>-
>
>Adding high-level Java 8 DSL based on Euphoria API
> project
>-
>
>See BEAM-3900  for
>more details.
>
> Performance improvements for HDFS file writing operations (by: Tim
> Robertson)
>
>-
>
>Autocreate directories when doing an HDFS rename
>-
>
>See PR  for more details
>
>
> Recognition of non-code contributions (by: Gris Cuevas)
>
>-
>
>Got consensus about recognizing non-code contributions
>-
>
>See
>
> 
>discussion for more details
>-
>
>Planned launch date: Beam Summit London (October 2nd)
>
>
> Weekly Community Updates (by: Gris Cuevas)
>
>-
>
>Some of the project’s subcomponents run weekly updates in the mailing
>list, we’ll be consolidating best practices to share a weekly community
>update with all project related must knows in a shell
>
>
>
> What’s planned
>
>
> Beam Cookbook (by: Austin Bennett, David Cavazos, Gris Cuevas, Andrea
> Foegler, Rose Nguyen, Connell O'Callaghan, and you!)
>
>-
>
>We are creating a cookbook for common data science tasks in Beam and
>have started brainstorming
>-
>
>We want to have a hackathon after the London Summit to generate
>content from the community
>-
>
>There will be a session at the summit to gather more ideas and input.
>Watch the dev and users mailing list for a call for contributions soon!.
>
>
> Beam 2.7.0 release (by: Charles Chen)
>
> Beam Mascot (by: Gris Cuevas & Community!)
>
>-
>
>We got approval to launch a contest to create a new Apache Beam mascot
>-
>
>See
>
>discussion for more details, if you’re interested in driving this, reach
>out in the thread!
>-
>
>Planned launch date: Last week of September
>
>
>
> New Members
>
>
> New Contributors
>
>-
>
>Đạt Trần, Ho Chi Minh City, Vietnam
>-
>
>   See BEAM-5107
>   
>   for more details on “Support ES-6.x for ElasticsearchIO”
>   -
>
>Ravi Pathak, Copenhagen, Denmark
>-
>
>   Using Beam for indexing 

Apache Beam Newsletter - September 2018

2018-09-10 Thread Rose Nguyen
[image: Beam.png]

September 2018 | Newsletter


What’s been done


CI improvement (by: Etienne Chauchot)

   -

   For each new commit on master Nexmark suite is run in both batch and
   streaming mode in Spark, Flink, Cloud Dataflow (thanks to Andrew) and
   dashboards graphs are produced to track functional and performance
   regressions.


Elasticsearch IO Supports Version 6 (by: Dat Tran)

   -

   Elasticsearch IO now supports version 6.x in addition to version 2.x and
   5.x.
   -

   See the merged PR
   
   for more details.


KuduIO Added (by: Tim Robertson)

   -

   Apache Beam master now has KuduIO that will be released with Beam 2.7.0.
   -

   See BEAM-2661  for more
   details.



What we’re working on...


Flink Portable Runner (by: Ankur Goenka, Maximilian Michels, Thomas Weise,
Ryan Williams)

   -

   Support for streaming side inputs merged

   -

   Portable Compatibility Matrix tests pass in streaming mode

   -

   Many more ValidatesRunner tests pass (ValidatesRunner is a comprehensive
   suite for Beam test pipelines)
   -

   Python Pipelines can be tested without bringing up a JobServer first (it
   is started in a container)
   -

   Experimental support for executing the SDK harnesses in a process
   instead of a Docker container
   -

   Bug fixes to Beam discovered during working on the portability

State and Timer Support in Python SDK (by: Charles Chen, Robert Bradshaw)

   -

   This change adds the reference DirectRunner implementation of the Python
   User State and Timers API. With this change, a user can execute DoFns with
   state and timers on the DirectRunner.
   -

   See the design doc
    and PR
    for more details..


New IO - HadoopOutputFormatIO (by: Alexey Romanenko)

   -

   Adding support of MapReduce OutputFormat.
   -

   See BEAM-5310  for more
   details.


High-level Java 8 DSL (by: David Moravek, Vaclav Plajt, Marek Simunek)

   -

   Adding high-level Java 8 DSL based on Euphoria API
    project
   -

   See BEAM-3900  for more
   details.

Performance improvements for HDFS file writing operations (by: Tim
Robertson)

   -

   Autocreate directories when doing an HDFS rename
   -

   See PR  for more details


Recognition of non-code contributions (by: Gris Cuevas)

   -

   Got consensus about recognizing non-code contributions
   -

   See
   

   discussion for more details
   -

   Planned launch date: Beam Summit London (October 2nd)


Weekly Community Updates (by: Gris Cuevas)

   -

   Some of the project’s subcomponents run weekly updates in the mailing
   list, we’ll be consolidating best practices to share a weekly community
   update with all project related must knows in a shell



What’s planned


Beam Cookbook (by: Austin Bennett, David Cavazos, Gris Cuevas, Andrea
Foegler, Rose Nguyen, Connell O'Callaghan, and you!)

   -

   We are creating a cookbook for common data science tasks in Beam and
   have started brainstorming
   -

   We want to have a hackathon after the London Summit to generate content
   from the community
   -

   There will be a session at the summit to gather more ideas and input.
   Watch the dev and users mailing list for a call for contributions soon!.


Beam 2.7.0 release (by: Charles Chen)

Beam Mascot (by: Gris Cuevas & Community!)

   -

   We got approval to launch a contest to create a new Apache Beam mascot
   -

   See
   
   discussion for more details, if you’re interested in driving this, reach
   out in the thread!
   -

   Planned launch date: Last week of September



New Members


New Contributors

   -

   Đạt Trần, Ho Chi Minh City, Vietnam
   -

  See BEAM-5107
  
  for more details on “Support ES-6.x for ElasticsearchIO”
  -

   Ravi Pathak, Copenhagen, Denmark
   -

  Using Beam for indexing open data on species at GBIF.org
  -

  Improving robustness of SolrIO


New Committers

   -

   Tim Robertson, Copenhagen, Denmark



Events, Talks & Meetups


[Coming Up] Beam Summit @ London, England

   -

   Organized by: Matthias Baetens, Victor Kotai, Alex Van Boxel & Gris
   Cuevas
   -

   The Beam Summit London 2018 will take place on October 1 and 2 in London.

   -

   If you’re interested in speaking reach out to g...@apache.org
   -

   More info can be found in the blog post
    and
   

Re: PTransforms and Fusion

2018-09-10 Thread Henning Rohde
For A, I am in favor of A1 and A2 as well. It is then up to each SDK to not
generate "empty" transforms in the proto representation as we avoid noise
as mentioned. The shared Java libraries are also optional and we should not
assume every runner will use them. I'm not convinced empty transforms would
have value for pipeline structure over what can be accomplished with normal
composites. I suspect empty transforms, such as A, B -> B, B, will just be
confusion generators.

For B, I favor B2 for the reasons Thomas mentions. I also agree with the -1
for B1.

On Mon, Sep 10, 2018 at 2:51 PM Thomas Weise  wrote:

> For B, note the prior discussion [1].
>
> B1 and B2 cannot be supported at the same time.
>
> Native transforms will almost always be customizations. Users do not
> create customizations without reason. They would start with what is there
> and dig deeper only when needed. Right now there are no streaming
> connectors in the Python SDK - should the user not use the SDK? Or is it
> better (now and in general) to have the option of a custom connector, even
> when it is not portable?
>
> Regarding portability, IMO it should be up to the user to decide how much
> of it is necessary/important. The IO requirements are normally dictated by
> the infrastructure. If it has Kafka, Kinesis or any other source (including
> those that Beam might never have a connector for), the user needs to be
> able to integrate it.
>
> Overall extensibility is important and will help users adopt Beam. This
> has come up in a few other areas (think Docker environments). I think we
> need to provide the flexibility and enable, not prevent alternatives and
> therefore -1 for B1 (unsurprisingly :).
>
> [1]
> https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E
>
>
> On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
> wrote:
>
>> A) I think it's a bug to not handle empty PTransforms (which are useful
>> at pipeline construction, and may still have meaning in terms of pipeline
>> structure, e.g. for visualization). Note that such transforms, if truly
>> composite, can't output any PCollections that do not appear in their inputs
>> (which is how we distinguish them from primitives in Python). Thus I'm in
>> favor of A3, and as a stopgap we can drop these transforms as part of/just
>> before decoding in the Java libraries (rather than in the SDKs during
>> encoding as in A2).
>>
>> B) I'm also for B1 or B2.
>>
>>
>> On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels 
>> wrote:
>>
>>> > A) What should we do with these "empty" PTransforms?
>>>
>>> We can't translate them, so dropping them seems the most reasonable
>>> choice. Should we throw an error/warning to make the user aware of this?
>>> Otherwise might be unexpected for the user.
>>>
>>> >> A3) Handle the "empty" PTransform case within all of the shared
>>> libraries.
>>>
>>> What can we do at this point other than dropping them?
>>>
>>> > B) What should we do with "native" PTransforms?
>>>
>>> I support B1 and B2 as well. Non-portable PTransforms should be
>>> discouraged in the long run. However, the available PTransforms are not
>>> even consistent across the different SDKs yet (e.g. no streaming
>>> connectors in Python), so we should continue to provide a way to utilize
>>> the "native" transforms of a Runner.
>>>
>>> -Max
>>>
>>> On 07.09.18 19:15, Lukasz Cwik wrote:
>>> > A primitive transform is a PTransform that has been chosen to have no
>>> > default implementation in terms of other PTransforms. A primitive
>>> > transform therefore must be implemented directly by a pipeline runner
>>> in
>>> > terms of pipeline-runner-specific concepts. An initial list of
>>> primitive
>>> > PTransforms were defined in [2] and has since been updated in [3].
>>> >
>>> > As part of the portability effort, libraries that are intended to be
>>> > shared across multiple runners are being developed to support their
>>> > migration to a portable execution model. One of these is responsible
>>> for
>>> > fusing multiple primitive PTransforms together into a pipeline runner
>>> > specific concept. This library made the choice that a primitive
>>> > PTransform is a PTransform that doesn't contain any other PTransforms.
>>> >
>>> > Unfortunately, while Ryan was attempting to enable testing of
>>> validates
>>> > runner tests for Flink using the new portability libraries, he ran
>>> into
>>> > an issue where the Apache Beam Java SDK allows for a person to
>>> construct
>>> > a PTransform that has zero sub PTransforms and also isn't one of the
>>> > defined Apache Beam primitives. In this case the PTransform was
>>> trivial
>>> > as it was not applying any additional transforms to input PCollection
>>> > and just returning it. This caused an issue within the portability
>>> > libraries since they couldn't handle this structure.
>>> >
>>> > To solve this issue, I had proposed that we modify the portability
>>> > library that does 

Re: [Proposal] Creating a reproducible environment for Beam Jenkins Tests

2018-09-10 Thread Henning Rohde
+1 Nice proposal. It will help eradicate some of the inflexibility and
frustrations with Jenkins.

On Wed, Sep 5, 2018 at 2:30 PM Yifan Zou  wrote:

> Thank you all for making comments on this and I apologize for the late
> reply.
>
> To clarify the concerns of testing locally, it is still able to run tests
> without Docker. One of the purposes of this is to create an identical
> environment as we are running in Jenkins that would be helpful to reproduce
> strange errors. Contributors could choose starting a container and run
> tests in there, or just run tests directly.
>
>
>
> On Wed, Sep 5, 2018 at 6:37 AM Ismaël Mejía  wrote:
>
>> BIG +1, the previous work on having docker build images [1] had a
>> similar goal (to have a reproducible build environment). But this is
>> even better because we will guarantee the exact same environment in
>> Jenkins as well as any further improvements. It is important to
>> document the setup process as part of this (for future maintenance +
>> local reproducibility).
>>
>> Just for clarification this is independent of running the tests
>> locally without docker, it is more to improve the reproducibility of
>> the environment we have on jenkins locally for example to address some
>> weird Heissenbug.
>>
>> I just added BEAM-5311 to track the removal of the docker build images
>> when this is ready (of course if there are no objections to this
>> proposal).
>>
>> [1] https://beam.apache.org/contribute/docker-images/
>> On Thu, Aug 30, 2018 at 3:58 PM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > Hi,
>> >
>> > That's interesting, however, it's really important to still be able to
>> > easily run test locally, without any VM/Docker required. It should be
>> > activated by profile or so.
>> >
>> > Regards
>> > JB
>> >
>> > On 27/08/2018 19:53, Yifan Zou wrote:
>> > > Hi,
>> > >
>> > > I have a proposal for creating a reproducible environment for Jenkins
>> > > tests by using docker container. The thing is, the environment
>> > > configurations on Beam Jenkins slaves are sometimes different from
>> > > developer's machines. Test failures on Jenkins may not be easy to
>> > > reproduce locally. Also, it is not convenient for developers to add or
>> > > modify underlying tools installed on Jenkins VMs, since they're
>> managed
>> > > by Apache Infra. This proposal is aimed to address those problems.
>> > >
>> > >
>> https://docs.google.com/document/d/1y0YuQj_oZXC0uM5-gniG7r9-5gv2uiDhzbtgYYJW48c/edit#heading=h.bg2yi0wbhl9n
>> > >
>> > > Any comments are welcome. Thank you.
>> > >
>> > > Regards.
>> > > Yifan
>> > >
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>
>


Re: New Post Commit Task fails in SetupVirtualEnv when running on Jenkins

2018-09-10 Thread Valentyn Tymofieiev
Thanks for sharing. This may be caused by
https://github.com/tox-dev/tox/issues/649.

On Fri, Sep 7, 2018 at 4:48 PM Ankur Goenka  wrote:

> It seems that the issue was with the length of file name which was
>
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python/build/gradleenv/bin/python2
> Changing the task name to beam_PostCommit_Python_PVR_Flink_Gradle_PR Fixed
> the issue.
>
> On Thu, Sep 6, 2018 at 6:37 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> I added a new Post Commit task
>> "beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle" to Jenkins in
>> https://github.com/apache/beam/pull/6322 but the jenkins job is failing
>> with logs below.
>> However the gradle task works fine when i execute it on my local machine
>> using ./gradlew :beam-sdks-python:flinkCompatibilityMatrixBatch. I tried to
>> trim down jenkins job to just have setupVirtualEnv and that also fails with
>> the same exception.
>> Here is an instance of failing run
>> https://builds.apache.org/job/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/14/console#gradle-task-34
>> and a corresponding run for beam_PostCommit_Python_Verify which succeeded
>> in setupVirtualEnv
>> https://builds.apache.org/job/beam_PostCommit_Python_Verify_PR/60/console
>>
>> Any pointers in resolving this will be greatly appreciated.
>>
>> Suspicious logs
>> 17:41:10 Task ':beam-sdks-python:setupVirtualenv' is not up-to-date
>> because:
>> 17:41:10   No history is available.
>> 17:41:10 Starting process 'command 'virtualenv''. Working directory:
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python
>> Command: virtualenv
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python/build/gradleenv
>> 17:41:10 Successfully started process 'command 'virtualenv''
>> 17:41:10 New python executable in
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python/build/gradleenv/bin/python2
>> 17:41:10 Also creating executable in
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python/build/gradleenv/bin/python
>> 17:41:11 Installing setuptools, pkg_resources, pip, wheel...done.
>> 17:41:11 Running virtualenv with interpreter /usr/bin/python2
>> 17:41:11 Starting process 'command 'sh''. Working directory:
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python
>> Command: sh -c .
>> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_PortableValidatesRunner_Flink_Gradle_PR/src/sdks/python/build/gradleenv/bin/activate
>> && pip install --upgrade tox==3.0.0 grpcio-tools==1.3.5
>> 17:41:11 Successfully started process 'command 'sh''
>> 17:41:11 Collecting tox==3.0.0
>> 17:41:11   Using cached
>> https://files.pythonhosted.org/packages/e6/41/4dcfd713282bf3213b0384320fa8841e4db032ddcb80bc08a540159d42a8/tox-3.0.0-py2.py3-none-any.whl
>> 17:41:11 Collecting grpcio-tools==1.3.5
>> 17:41:12   Using cached
>> https://files.pythonhosted.org/packages/05/f6/0296e29b1bac6f85d2a8556d48adf825307f73109a3c2c17fb734292db0a/grpcio_tools-1.3.5-cp27-cp27mu-manylinux1_x86_64.whl
>> 17:41:12 Collecting pluggy<1.0,>=0.3.0 (from tox==3.0.0)
>> 17:41:12   Using cached
>> https://files.pythonhosted.org/packages/f5/f1/5a93c118663896d83f7bcbfb7f657ce1d0c0d617e6b4a443a53abcc658ca/pluggy-0.7.1-py2.py3-none-any.whl
>> 17:41:12 Requirement not upgraded as not directly required: six in
>> /usr/local/lib/python2.7/dist-packages (from tox==3.0.0) (1.11.0)
>> 17:41:12 Requirement not upgraded as not directly required:
>> virtualenv>=1.11.2 in /usr/lib/python2.7/dist-packages (from tox==3.0.0)
>> (15.0.1)
>> 17:41:12 Collecting py>=1.4.17 (from tox==3.0.0)
>> 17:41:12   Using cached
>> https://files.pythonhosted.org/packages/c8/47/d179b80ab1dc1bfd46a0c87e391be47e6c7ef5831a9c138c5c49d1756288/py-1.6.0-py2.py3-none-any.whl
>> 17:41:12 Requirement not upgraded as not directly required: grpcio>=1.3.5
>> in /home/jenkins/.local/lib/python2.7/site-packages (from
>> grpcio-tools==1.3.5) (1.13.0)
>> 17:41:12 Requirement not upgraded as not directly required:
>> protobuf>=3.2.0 in /home/jenkins/.local/lib/python2.7/site-packages (from
>> grpcio-tools==1.3.5) (3.6.0)
>> 17:41:12 Requirement not upgraded as not directly required: enum34>=1.0.4
>> in /usr/local/lib/python2.7/dist-packages (from
>> grpcio>=1.3.5->grpcio-tools==1.3.5) (1.1.6)
>> 17:41:12 Requirement not upgraded as not directly required:
>> futures>=2.2.0 in /home/jenkins/.local/lib/python2.7/site-packages (from
>> grpcio>=1.3.5->grpcio-tools==1.3.5) (3.2.0)
>> 17:41:12 Requirement not upgraded as not directly required: setuptools in
>> /home/jenkins/.local/lib/python2.7/site-packages (from
>> protobuf>=3.2.0->grpcio-tools==1.3.5) (40.0.0)
>> 17:41:13 

Re: PTransforms and Fusion

2018-09-10 Thread Thomas Weise
For B, note the prior discussion [1].

B1 and B2 cannot be supported at the same time.

Native transforms will almost always be customizations. Users do not create
customizations without reason. They would start with what is there and dig
deeper only when needed. Right now there are no streaming connectors in the
Python SDK - should the user not use the SDK? Or is it better (now and in
general) to have the option of a custom connector, even when it is not
portable?

Regarding portability, IMO it should be up to the user to decide how much
of it is necessary/important. The IO requirements are normally dictated by
the infrastructure. If it has Kafka, Kinesis or any other source (including
those that Beam might never have a connector for), the user needs to be
able to integrate it.

Overall extensibility is important and will help users adopt Beam. This has
come up in a few other areas (think Docker environments). I think we need
to provide the flexibility and enable, not prevent alternatives and
therefore -1 for B1 (unsurprisingly :).

[1]
https://lists.apache.org/thread.html/9813ee10cb1cd9bf64e1c4f04c02b606c7b12d733f4505fb62f4a954@%3Cdev.beam.apache.org%3E


On Mon, Sep 10, 2018 at 10:14 AM Robert Bradshaw 
wrote:

> A) I think it's a bug to not handle empty PTransforms (which are useful at
> pipeline construction, and may still have meaning in terms of pipeline
> structure, e.g. for visualization). Note that such transforms, if truly
> composite, can't output any PCollections that do not appear in their inputs
> (which is how we distinguish them from primitives in Python). Thus I'm in
> favor of A3, and as a stopgap we can drop these transforms as part of/just
> before decoding in the Java libraries (rather than in the SDKs during
> encoding as in A2).
>
> B) I'm also for B1 or B2.
>
>
> On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels  wrote:
>
>> > A) What should we do with these "empty" PTransforms?
>>
>> We can't translate them, so dropping them seems the most reasonable
>> choice. Should we throw an error/warning to make the user aware of this?
>> Otherwise might be unexpected for the user.
>>
>> >> A3) Handle the "empty" PTransform case within all of the shared
>> libraries.
>>
>> What can we do at this point other than dropping them?
>>
>> > B) What should we do with "native" PTransforms?
>>
>> I support B1 and B2 as well. Non-portable PTransforms should be
>> discouraged in the long run. However, the available PTransforms are not
>> even consistent across the different SDKs yet (e.g. no streaming
>> connectors in Python), so we should continue to provide a way to utilize
>> the "native" transforms of a Runner.
>>
>> -Max
>>
>> On 07.09.18 19:15, Lukasz Cwik wrote:
>> > A primitive transform is a PTransform that has been chosen to have no
>> > default implementation in terms of other PTransforms. A primitive
>> > transform therefore must be implemented directly by a pipeline runner
>> in
>> > terms of pipeline-runner-specific concepts. An initial list of
>> primitive
>> > PTransforms were defined in [2] and has since been updated in [3].
>> >
>> > As part of the portability effort, libraries that are intended to be
>> > shared across multiple runners are being developed to support their
>> > migration to a portable execution model. One of these is responsible
>> for
>> > fusing multiple primitive PTransforms together into a pipeline runner
>> > specific concept. This library made the choice that a primitive
>> > PTransform is a PTransform that doesn't contain any other PTransforms.
>> >
>> > Unfortunately, while Ryan was attempting to enable testing of validates
>> > runner tests for Flink using the new portability libraries, he ran into
>> > an issue where the Apache Beam Java SDK allows for a person to
>> construct
>> > a PTransform that has zero sub PTransforms and also isn't one of the
>> > defined Apache Beam primitives. In this case the PTransform was trivial
>> > as it was not applying any additional transforms to input PCollection
>> > and just returning it. This caused an issue within the portability
>> > libraries since they couldn't handle this structure.
>> >
>> > To solve this issue, I had proposed that we modify the portability
>> > library that does fusion to use a whitelist of primitives preventing
>> the
>> > issue from happening. This solved the problem but caused an issue for
>> > Thomas as he was relying on this behaviour of PTransforms with zero sub
>> > transforms being primitives. Thomas has a use-case where he wants to
>> > expose the internal Flink Kafka and Kinesis connectors and to build
>> > Apache Beam pipelines that use the Flink native sources/sinks. I'll
>> call
>> > these "native" PTransforms, since they aren't part of the Apache Beam
>> > model and are runner specific.
>> >
>> > This brings up two topics:
>> > A) What should we do with these "empty" PTransforms?
>> > B) What should we do with "native" PTransforms?
>> >
>> > The typical flow of a pipeline 

Re: Gradle Races in beam-examples-java, beam-runners-apex

2018-09-10 Thread Lukasz Cwik
I had originally suggested to use some Linux kernel tooling such as
inotifywait[1] to watch what is happening.

It is likely that we have some Gradle task which is running something in
parallel to a different Gradle task when it shouldn't which means that the
jar file is being changed/corrupted. I believe fixing our Gradle task
dependency tree wrt to this would solve the problem. This crash does not
reproduce on my desktop after 20 runs which makes it hard for me to test
for.

1: https://www.linuxjournal.com/content/linux-filesystem-events-inotify

On Mon, Sep 10, 2018 at 1:15 PM Ryan Williams  wrote:

> this continues to be an issue locally (cf. some discussion in #beam slack)
>
> commands like `./gradlew javaPreCommit` or `./gradlew build` reliably fail
> with a range of different
> 
>  JVM
> crashes
> 
> in a few different tasks, with messages that suggest filing a bug against
> the Java compiler
>
> what do we know about the actual race condition that is allowing one task
> to attempt to read from a JAR that is being overwritten by another task?
> presumably this is just a bug in our Gradle configs?
>
> On Mon, Aug 27, 2018 at 2:28 PM Andrew Pilloud 
> wrote:
>
>> It appears that there is no one working on a fix for the flakes, so I've
>> merged the change to disable parallel tasks on precommit.
>>
>> Andrew
>>
>> On Fri, Aug 24, 2018 at 1:30 PM Andrew Pilloud 
>> wrote:
>>
>>> I'm seeing failures due to this on 12 of the last 16 PostCommits.
>>> Precommits take about 22 minutes run in parallel, so at a 25% pass rate
>>> that puts the expected time to a good test run at 264 minutes assuming you
>>> immediately restart on each failure. We are looking at 56 minutes for a
>>> precommit that isn't run in parallel:
>>> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/266/ I'd
>>> rather have tests take a little longer then have to monitor them for
>>> several hours.
>>>
>>> I've opened a PR: https://github.com/apache/beam/pull/6274
>>>
>>> Andrew
>>>
>>> On Fri, Aug 24, 2018 at 10:47 AM Lukasz Cwik  wrote:
>>>
 I believe it would mitigate the issue but also make the jobs take much
 longer to complete.

 On Thu, Aug 23, 2018 at 2:44 PM Andrew Pilloud 
 wrote:

> There seems to be a misconfiguration of gradle that is causing a high
> rate of failure for the last several weeks in building beam-examples-java
> and beam-runners-apex. It appears to be some sort of race condition in
> building dependencies. Given that no one has made progress on fixing the
> root cause, is this something we could mitigate by running jobs with
> `--no-parallel` flag?
>
> https://issues.apache.org/jira/browse/BEAM-5035
> https://issues.apache.org/jira/browse/BEAM-5207
>
> Andrew
>



Re: Gradle Races in beam-examples-java, beam-runners-apex

2018-09-10 Thread Ryan Williams
this continues to be an issue locally (cf. some discussion in #beam slack)

commands like `./gradlew javaPreCommit` or `./gradlew build` reliably fail
with a range of different

JVM
crashes

in a few different tasks, with messages that suggest filing a bug against
the Java compiler

what do we know about the actual race condition that is allowing one task
to attempt to read from a JAR that is being overwritten by another task?
presumably this is just a bug in our Gradle configs?

On Mon, Aug 27, 2018 at 2:28 PM Andrew Pilloud  wrote:

> It appears that there is no one working on a fix for the flakes, so I've
> merged the change to disable parallel tasks on precommit.
>
> Andrew
>
> On Fri, Aug 24, 2018 at 1:30 PM Andrew Pilloud 
> wrote:
>
>> I'm seeing failures due to this on 12 of the last 16 PostCommits.
>> Precommits take about 22 minutes run in parallel, so at a 25% pass rate
>> that puts the expected time to a good test run at 264 minutes assuming you
>> immediately restart on each failure. We are looking at 56 minutes for a
>> precommit that isn't run in parallel:
>> https://builds.apache.org/job/beam_PreCommit_Java_Phrase/266/ I'd rather
>> have tests take a little longer then have to monitor them for several hours.
>>
>> I've opened a PR: https://github.com/apache/beam/pull/6274
>>
>> Andrew
>>
>> On Fri, Aug 24, 2018 at 10:47 AM Lukasz Cwik  wrote:
>>
>>> I believe it would mitigate the issue but also make the jobs take much
>>> longer to complete.
>>>
>>> On Thu, Aug 23, 2018 at 2:44 PM Andrew Pilloud 
>>> wrote:
>>>
 There seems to be a misconfiguration of gradle that is causing a high
 rate of failure for the last several weeks in building beam-examples-java
 and beam-runners-apex. It appears to be some sort of race condition in
 building dependencies. Given that no one has made progress on fixing the
 root cause, is this something we could mitigate by running jobs with
 `--no-parallel` flag?

 https://issues.apache.org/jira/browse/BEAM-5035
 https://issues.apache.org/jira/browse/BEAM-5207

 Andrew

>>>


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-10 Thread Lukasz Cwik
I found an issue where we are no longer packaging the pom.xml within the
artifact jars at META-INF/maven/groupId/artifactId. More details in
https://issues.apache.org/jira/browse/BEAM-5351. I wouldn't consider this a
blocker but it was an easy fix (https://github.com/apache/beam/pull/6358)
and users may rely on the pom.xml.

Should we recut the release candidate to include this?

On Mon, Sep 10, 2018 at 4:58 AM Jean-Baptiste Onofré 
wrote:

> +1 (binding)
>
> Tested successfully on Beam Samples.
>
> Thanks !
>
> Regards
> JB
>
> On 07/09/2018 23:56, Charles Chen wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> > 2.7.0, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> >  [2], which is signed with the key with
> > fingerprint 45C60AAAD115F560 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.7.0-RC1" [5],
> > * website pull request listing the release and publishing the API
> > reference manual [6].
> > * Java artifacts were built with Gradle 4.8 and OpenJDK
> > 1.8.0_181-8u181-b13-1~deb9u1-b13.
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org  [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Charles
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
> > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1046/
> > [5] https://github.com/apache/beam/tree/v2.7.0-RC1
> > [6] https://github.com/apache/beam-site/pull/549
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [portablility] metrics interrogations

2018-09-10 Thread Lukasz Cwik
Alex is out on vacation for the next 3 weeks.

Alex had proposed the types of metrics[1] but not the exact protocol as to
what the SDK and runner do. I could envision Alex proposing that the SDK
harness only sends diffs or dirty metrics in intermediate updates and all
metrics values in the final update.
Robert is referring to an integration that happened to an older set of
messages[2] that preceeded Alex's proposal and that integration with
Dataflow which is still incomplete works as you described in #2.

Robin had recently been considering adding an accessor to DoFns that would
allow you to get access to the job information from within the pipeline
(current state, poll for metrics, invoke actions like cancel / drain, ...).
He wanted it so he could poll for attempted metrics to be able to test
@RequiresStableInput. Integrating the MetricsPusher or something like that
on the SDK side to be able to poll metrics over the job information
accessor could be useful.

1: https://s.apache.org/beam-fn-api-metrics
2:
https://github.com/apache/beam/blob/9b68f926628d727e917b6a33ccdafcfe693eef6a/model/fn-execution/src/main/proto/beam_fn_api.proto#L410



On Mon, Sep 10, 2018 at 8:41 AM Robert Burke  wrote:

> The way I entered them into the Go SDK is #2 (SDK sends diffs per bundle)
> and the Java Runner Harness appears to aggregate them correctly from there.
>
> On Mon, Sep 10, 2018, 2:07 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> @Luke, @Alex I have a general question related to metrics in the Fn API:
>> as the communication between runner harness and SDK harness is done on a
>> bundle basis. When the runner harness sends data to the sdk harness to
>> execute a transform that contains metrics, does it:
>>
>>1. send metrics values (for the ones defined in the transform)
>>alongside with data and receive an updated value of the metrics from the
>>sdk harness when the bundle is finished processing?
>>2. or does it send only the data and the sdk harness responds with a
>>diff value of the metrics so that the runner can update them in its side?
>>
>> My bet is option 2. But can you confirm?
>>
>> Thanks
>>
>> Etienne
>>
>> Le jeudi 19 juillet 2018 à 15:10 +0200, Etienne Chauchot a écrit :
>>
>> Thanks for the confirmations Luke.
>>
>> Le mercredi 18 juillet 2018 à 07:56 -0700, Lukasz Cwik a écrit :
>>
>>
>>
>> On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot 
>> wrote:
>>
>> Hi,
>> Luke, Alex, I have some portable metrics interrogations, can you confirm
>> them ?
>>
>> 1 - As it is the SDK harness that will run the code of the UDFs, if a UDF
>> defines a metric, then the SDK harness will give updates through GRPC calls
>> to the runner so that the runner could update metrics cells, right?
>>
>>
>> Yes.
>>
>>
>>
>> 2 - Alex, you mentioned in proto and design doc that there will be no
>> aggreagation of metrics. But some runners (spark/flink) rely on
>> accumulators and when they are merged, it triggers the merging of the whole
>> chain to the metric cells. I know that Dataflow does not do the same, it
>> uses non agregated metrics and sends them to an aggregation service. Will
>> there be a change of paradigm with portability for runners that merge
>> themselves ?
>>
>>
>> There will be local aggregation of metrics scoped to a bundle; after the
>> bundle is finished processing they are discarded. This will require some
>> kind of global aggregation support from a runner, whether that runner does
>> it via accumulators or via an aggregation service is up to the runner.
>>
>> 3 - Please confirm that the distinction between attempted and committed
>> metrics is not the business of portable metrics. Indeed, it does not
>> involve communication between the runner harness and the SDK harness as it
>> is a runner only matter. I mean, when a runner commits a bundle it just
>> updates its committed metrics and do not need to inform the SDK harness.
>> But, of course, when the user requests committed metrics through the SDK,
>> then the SDK harness will ask the runner harness to give them.
>>
>>
>>
>> You are correct in saying that during execution, the SDK does not
>> differentiate between attempted and committed metrics and only the runner
>> does. We still lack an API definition and contract for how an SDK would
>> query for metrics from a runner but your right in saying that an SDK could
>> request committed metrics and the Runner would supply them some how.
>>
>>
>> Thanks
>>
>> Best
>> Etienne
>>
>>
>>
>>


Re: [FYI] Paper of Building Beam Runner for IBM Streams

2018-09-10 Thread Tim
Thanks for sharing Manu - interesting paper indeed.

Tim

> On 10 Sep 2018, at 16:02, Maximilian Michels  wrote:
> 
> Excellent write-up. Thank you!
> 
>> On 09.09.18 20:43, Jean-Baptiste Onofré wrote:
>> Good idea. It could also help people who wants to create runners.
>> Regards
>> JB
>>> On 09/09/2018 13:00, Manu Zhang wrote:
>>> Hi all,
>>> 
>>> I've spent the weekend reading Challenges and Experiences in Building an
>>> Efficient Apache Beam Runner For IBM Streams
>>>  from the August
>>> proceeding of PVLDB. It's quite enjoyable and urges me to reflect on how
>>> I (should've) implemented the Gearpump runner. I believe it will be
>>> beneficial to have more such papers and discussions as sharing design
>>> choices and lessons from various runners.
>>> 
>>> Enjoy it !
>>> Manu Zhang


Re: [PROPOSAL] Test performance of basic Apache Beam operations

2018-09-10 Thread Łukasz Gajowy
In my opinion and as far as I understand Nexmark, there are some benefits
to having both types of tests. The load tests we propose can be very
straightforward and clearly show what is being tested thanks to the fact
that there's no fixed model but very "low level" KV
collections only. They are more flexible in shapes of the pipelines they
can express e.g. fanout_64, without having to think about specific use
cases.

Having both types would allow developers to decide whether they want to
create a new Nexmark query for their specific case or develop a new Load
test (whatever is easier and more fits their case). However, there is a
risk - with KV developer can overemphasize cases that can
never happen in practice, so we need to be careful about the exact
configurations we run.

Still, I can imagine that there surely will be code that should be common
to both types of tests and we seek ways to not duplicate code.

WDYT?

Regards,
Łukasz



pon., 10 wrz 2018 o 16:36 Etienne Chauchot 
napisał(a):

> Hi,
> It seems that there is a notable overlap with what Nexmark already does:
> Nexmark mesures performance and regression by exercising all the Beam
> model in both batch and streaming modes with several runners. It also
> computes on synthetic data. Also nexmark is already included as PostCommits
> in the CI and dashboards.
>
> Shall we merge the two?
>
> Best
>
> Etienne
>
> Le lundi 10 septembre 2018 à 12:56 +0200, Łukasz Gajowy a écrit :
>
> Hello everyone,
>
> thank you for all your comments to the proposal. To sum up:
>
> A set of performance tests exercising Core Beam Transforms (ParDo,
> GroupByKey, CoGroupByKey, Combine) will be implemented for Java and Python
> SDKs. Those tests will allow to:
>
>- measure performance of the transforms on various runners
>- exercise the transforms by creating stressful conditions and big
>loads using Synthetic Source and Synthetic Step API (delays, keeping cpu
>busy or asleep, processing large keys and values, performing fanout or
>reiteration of inputs)
>- run both in batch and streaming context
>- gather various metrics
>- notice regressions by comparing data from consequent Jenkins runs
>
> Metrics (runtime, consumed bytes, memory usage, split/bundle count) can be
> gathered during test invocations. We will start with runtime and leverage
> Metrics API to collect the other metrics in later phases of development.
> The tests will be fully configurable through pipeline options and it will
> be possible to run any custom scenarios manually. However, a representative
> set of testing scenarios will be run periodically using Jenkins.
>
> Regards,
> Łukasz
>
> śr., 5 wrz 2018 o 20:31 Rafael Fernandez  napisał(a):
>
> neat! left a comment or two
>
> On Mon, Sep 3, 2018 at 3:53 AM Łukasz Gajowy  wrote:
>
> Hi all!
>
> I'm bumping this (in case you missed it). Any feedback and questions are
> welcome!
>
> Best regards,
> Łukasz
>
> pon., 13 sie 2018 o 13:51 Jean-Baptiste Onofré 
> napisał(a):
>
> Hi Lukasz,
>
> Thanks for the update, and the abstract looks promising.
>
> Let me take a look on the doc.
>
> Regards
> JB
>
> On 13/08/2018 13:24, Łukasz Gajowy wrote:
> > Hi all,
> >
> > since Synthetic Sources API has been introduced in Java and Python SDK,
> > it can be used to test some basic Apache Beam operations (i.e.
> > GroupByKey, CoGroupByKey Combine, ParDo and ParDo with SideInput) in
> > terms of performance. This, in brief, is why we'd like to share the
> > below proposal:
> >
> > _
> https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit?usp=sharing_
> >
> > Let us know what you think in the document's comments. Thank you in
> > advance for all the feedback!
> >
> > Łukasz
>
>


Re: [PROPOSAL] Beam jira bot exceptions

2018-09-10 Thread Yifan Zou
Thanks for raising this problem. There was a discussion about those issues
and we came up with new policies for the Jira bot. At this point, we could
close the JIRAs so that the bot won't create duplicate issues with the same
version.
https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:%5BDISCUSS%5D%20Versioning%2C%20Hadoop%20related%20dependencies%20and%20enterprise%20users


Regards.
Yifan

On Mon, Sep 10, 2018 at 6:14 AM Jean-Baptiste Onofré 
wrote:

> +1
>
> It would be great to be able to configure "known issues" ;)
>
> Regards
> JB
>
> On 10/09/2018 15:07, Etienne Chauchot wrote:
> > Hi,
> >
> > The bot for dependency checks opens tickets automatically when it sees
> > staled dependencies. But in some cases it is perfectly normal.
> > For example, in elasticsearchIO we have the core module (used by the
> > users) that is compatible with all the versions (for backward
> > compatibility and ease of use). But there is also one test module per
> > supported version (v2, v5, v6) that runs tests against an embedded
> > version of ES. These modules contain ES deps in v2, v5 and v6.
> > In such a case the bot should not open a ticket. Shall we configure it
> > with exceptions ?
> >
> > Note that for now I closed the related tickets.
> >
> > Best
> > Etienne
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [portablility] metrics interrogations

2018-09-10 Thread Robert Burke
The way I entered them into the Go SDK is #2 (SDK sends diffs per bundle)
and the Java Runner Harness appears to aggregate them correctly from there.

On Mon, Sep 10, 2018, 2:07 AM Etienne Chauchot  wrote:

> Hi all,
>
> @Luke, @Alex I have a general question related to metrics in the Fn API:
> as the communication between runner harness and SDK harness is done on a
> bundle basis. When the runner harness sends data to the sdk harness to
> execute a transform that contains metrics, does it:
>
>1. send metrics values (for the ones defined in the transform)
>alongside with data and receive an updated value of the metrics from the
>sdk harness when the bundle is finished processing?
>2. or does it send only the data and the sdk harness responds with a
>diff value of the metrics so that the runner can update them in its side?
>
> My bet is option 2. But can you confirm?
>
> Thanks
>
> Etienne
>
> Le jeudi 19 juillet 2018 à 15:10 +0200, Etienne Chauchot a écrit :
>
> Thanks for the confirmations Luke.
>
> Le mercredi 18 juillet 2018 à 07:56 -0700, Lukasz Cwik a écrit :
>
>
>
> On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot 
> wrote:
>
> Hi,
> Luke, Alex, I have some portable metrics interrogations, can you confirm
> them ?
>
> 1 - As it is the SDK harness that will run the code of the UDFs, if a UDF
> defines a metric, then the SDK harness will give updates through GRPC calls
> to the runner so that the runner could update metrics cells, right?
>
>
> Yes.
>
>
>
> 2 - Alex, you mentioned in proto and design doc that there will be no
> aggreagation of metrics. But some runners (spark/flink) rely on
> accumulators and when they are merged, it triggers the merging of the whole
> chain to the metric cells. I know that Dataflow does not do the same, it
> uses non agregated metrics and sends them to an aggregation service. Will
> there be a change of paradigm with portability for runners that merge
> themselves ?
>
>
> There will be local aggregation of metrics scoped to a bundle; after the
> bundle is finished processing they are discarded. This will require some
> kind of global aggregation support from a runner, whether that runner does
> it via accumulators or via an aggregation service is up to the runner.
>
> 3 - Please confirm that the distinction between attempted and committed
> metrics is not the business of portable metrics. Indeed, it does not
> involve communication between the runner harness and the SDK harness as it
> is a runner only matter. I mean, when a runner commits a bundle it just
> updates its committed metrics and do not need to inform the SDK harness.
> But, of course, when the user requests committed metrics through the SDK,
> then the SDK harness will ask the runner harness to give them.
>
>
>
> You are correct in saying that during execution, the SDK does not
> differentiate between attempted and committed metrics and only the runner
> does. We still lack an API definition and contract for how an SDK would
> query for metrics from a runner but your right in saying that an SDK could
> request committed metrics and the Runner would supply them some how.
>
>
> Thanks
>
> Best
> Etienne
>
>
>
>


Re: [PROPOSAL] Test performance of basic Apache Beam operations

2018-09-10 Thread Etienne Chauchot
Hi,It seems that there is a notable overlap with what Nexmark already 
does:Nexmark mesures performance and regression by
exercising  all the Beam model in both batch and streaming modes with several 
runners. It also computes on synthetic
data. Also nexmark is already included as PostCommits in the CI and dashboards.
Shall we merge the two?
Best
Etienne
Le lundi 10 septembre 2018 à 12:56 +0200, Łukasz Gajowy a écrit :
> Hello everyone, 
> 
> thank you for all your comments to the proposal. To sum up: 
> 
> A set of performance tests exercising Core Beam Transforms (ParDo, 
> GroupByKey, CoGroupByKey, Combine) will be
> implemented for Java and Python SDKs. Those tests will allow to: 
> measure performance of the transforms on various runners
> exercise the transforms by creating stressful conditions and big loads using 
> Synthetic Source and Synthetic Step API
> (delays, keeping cpu busy or asleep, processing large keys and values, 
> performing fanout or reiteration of inputs)
> run both in batch and streaming context
> gather various metrics
> notice regressions by comparing data from consequent Jenkins runs  
> Metrics (runtime, consumed bytes, memory usage, split/bundle count) can be 
> gathered during test invocations. We will
> start with runtime and leverage Metrics API to collect the other metrics in 
> later phases of development. 
> The tests will be fully configurable through pipeline options and it will be 
> possible to run any custom scenarios
> manually. However, a representative set of testing scenarios will be run 
> periodically using Jenkins.
> 
> Regards, 
> Łukasz 
> 
> śr., 5 wrz 2018 o 20:31 Rafael Fernandez  napisał(a):
> > neat! left a comment or two
> > 
> > On Mon, Sep 3, 2018 at 3:53 AM Łukasz Gajowy  wrote:
> > > Hi all! 
> > > 
> > > I'm bumping this (in case you missed it). Any feedback and questions are 
> > > welcome!
> > > 
> > > Best regards, 
> > > Łukasz
> > > 
> > > pon., 13 sie 2018 o 13:51 Jean-Baptiste Onofré  
> > > napisał(a):
> > > > Hi Lukasz,
> > > > 
> > > > 
> > > > 
> > > > Thanks for the update, and the abstract looks promising.
> > > > 
> > > > 
> > > > 
> > > > Let me take a look on the doc.
> > > > 
> > > > 
> > > > 
> > > > Regards
> > > > 
> > > > JB
> > > > 
> > > > 
> > > > 
> > > > On 13/08/2018 13:24, Łukasz Gajowy wrote:
> > > > 
> > > > > Hi all, 
> > > > 
> > > > > 
> > > > 
> > > > > since Synthetic Sources API has been introduced in Java and Python 
> > > > > SDK,
> > > > 
> > > > > it can be used to test some basic Apache Beam operations (i.e.
> > > > 
> > > > > GroupByKey, CoGroupByKey Combine, ParDo and ParDo with SideInput) in
> > > > 
> > > > > terms of performance. This, in brief, is why we'd like to share the
> > > > 
> > > > > below proposal:
> > > > 
> > > > > 
> > > > 
> > > > > _https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit?usp=sharing_
> > > > 
> > > > > 
> > > > 
> > > > > Let us know what you think in the document's comments. Thank you in
> > > > 
> > > > > advance for all the feedback!
> > > > 
> > > > > 
> > > > 
> > > > > Łukasz
> > > > 
> > > > 
> > > > 

Re: SplittableDoFn

2018-09-10 Thread Maximilian Michels

Thanks for moving forward with this, Lukasz!

Unfortunately, can't make it on Friday but I'll sync with somebody on 
the call (e.g. Ryan) about your discussion.


On 08.09.18 02:00, Lukasz Cwik wrote:
Thanks for everyone who wanted to fill out the doodle poll. The most 
popular time was Friday Sept 14th from 11am-noon PST. I'll send out a 
calendar invite and meeting link early next week.


I have received a lot of feedback on the document and have addressed 
some parts of it including:

* clarifying terminology
* processing skew due to some restrictions having their watermarks much 
further behind then others affecting scheduling of bundles by runners
* external throttling & I/O wait overhead reporting to make sure we 
don't overscale


Areas that still need additional feedback and details are:
* reporting progress around the work that is done and is active
* more examples
* unbounded restrictions being caused by an unbounded number of splits 
of existing unbounded restrictions (infinite work growth)
* whether we should be reporting this information at the PTransform 
level or at the bundle level




On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik > wrote:


Thanks to all those who have provided interest in this topic by the
questions they have asked on the doc already and for those
interested in having this discussion. I have setup this doodle to
allow people to provide their availability:
https://doodle.com/poll/nrw7w84255xnfwqy

I'll send out the chosen time based upon peoples availability and a
Hangout link by end of day Friday so please mark your availability
using the link above.

The agenda of the meeting will be as follows:
* Overview of the proposal
* Enumerate and discuss/answer questions brought up in the meeting

Note that all questions and any discussions/answers provided will be
added to the doc for those who are unable to attend.

On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré
mailto:j...@nanthrax.net>> wrote:

+1

Regards
JB
Le 31 août 2018, à 18:22, Lukasz Cwik mailto:lc...@google.com>> a écrit:

That is possible, I'll take people's date/time suggestions
and create a simple online poll with them.

On Fri, Aug 31, 2018 at 2:22 AM Robert Bradshaw
mailto:rober...@google.com>> wrote:

Thanks for taking this up. I added some comments to the
doc. A European-friendly time for discussion would
be great.

On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik
mailto:lc...@google.com>> wrote:

I came up with a proposal[1] for a progress model
solely based off of the backlog and that splits
should be based upon the remaining backlog we want
the SDK to split at. I also give recommendations to
runner authors as to how an autoscaling system could
work based upon the measured backlog. A lot of
discussions around progress reporting and splitting
in the past has always been around finding an
optimal solution, after reading a lot of information
about work stealing, I don't believe there is a
general solution and it really is upto
SplittableDoFns to be well behaved. I did not do
much work in classifying what a well behaved
SplittableDoFn is though. Much of this work builds
off ideas that Eugene had documented in the past[2].

I could use the communities wide knowledge of
different I/Os to see if computing the backlog is
practical in the way that I'm suggesting and to
gather people's feedback.

If there is a lot of interest, I would like to hold
a community video conference between Sept 10th and
14th about this topic. Please reply with your
availability by Sept 6th if your interested.

1: https://s.apache.org/beam-bundles-backlog-splitting
2: https://s.apache.org/beam-breaking-fusion

On Mon, Aug 13, 2018 at 10:21 AM Jean-Baptiste
Onofré mailto:j...@nanthrax.net>> wrote:

Awesome !

Thanks Luke !

I plan to work with you and others on this one.

Regards
JB
Le 13 août 2018, à 19:14, Lukasz Cwik
mailto:lc...@google.com>> a
écrit:

I wanted to reach out that I will be
continuing from where Eugene left off 

Re: [Call for items] September Beam Newsletter

2018-09-10 Thread Maximilian Michels

Good stuff! Left some items for the Flink Runner.

On 08.09.18 02:14, Rose Nguyen wrote:

*bump*

Celebrate the weekend by sharing with the community your talks, 
contributions, plans, etc!


On Wed, Sep 5, 2018 at 10:25 AM Rose Nguyen > wrote:


Hi Beamers:

Here's


[1] the template for the September Beam Newsletter!

*Add the highlights from August to now (or planned events and talks)
that you want to share with the community by 9/8 11:59 p.m. **PDT.*
*
*
We will collect the notes via Google docs but send out the final
version directly to the user mailing list. If you do not know how to
format something, it is OK to just put down the info and I will
edit. I'll ship out the newsletter on 9/10.

Looking forward to your interesting stories. :)

[1]

https://docs.google.com/document/d/1PE97Cf3yoNcx_A9zPzROT_kPtujRZLJoZyqIQfzqUvY/edit
-- 
Rose Thị Nguyễn




--
Rose Thị Nguyễn


Re: PTransforms and Fusion

2018-09-10 Thread Robert Bradshaw
A) I think it's a bug to not handle empty PTransforms (which are useful at
pipeline construction, and may still have meaning in terms of pipeline
structure, e.g. for visualization). Note that such transforms, if truly
composite, can't output any PCollections that do not appear in their inputs
(which is how we distinguish them from primitives in Python). Thus I'm in
favor of A3, and as a stopgap we can drop these transforms as part of/just
before decoding in the Java libraries (rather than in the SDKs during
encoding as in A2).

B) I'm also for B1 or B2.


On Mon, Sep 10, 2018 at 3:58 PM Maximilian Michels  wrote:

> > A) What should we do with these "empty" PTransforms?
>
> We can't translate them, so dropping them seems the most reasonable
> choice. Should we throw an error/warning to make the user aware of this?
> Otherwise might be unexpected for the user.
>
> >> A3) Handle the "empty" PTransform case within all of the shared
> libraries.
>
> What can we do at this point other than dropping them?
>
> > B) What should we do with "native" PTransforms?
>
> I support B1 and B2 as well. Non-portable PTransforms should be
> discouraged in the long run. However, the available PTransforms are not
> even consistent across the different SDKs yet (e.g. no streaming
> connectors in Python), so we should continue to provide a way to utilize
> the "native" transforms of a Runner.
>
> -Max
>
> On 07.09.18 19:15, Lukasz Cwik wrote:
> > A primitive transform is a PTransform that has been chosen to have no
> > default implementation in terms of other PTransforms. A primitive
> > transform therefore must be implemented directly by a pipeline runner in
> > terms of pipeline-runner-specific concepts. An initial list of primitive
> > PTransforms were defined in [2] and has since been updated in [3].
> >
> > As part of the portability effort, libraries that are intended to be
> > shared across multiple runners are being developed to support their
> > migration to a portable execution model. One of these is responsible for
> > fusing multiple primitive PTransforms together into a pipeline runner
> > specific concept. This library made the choice that a primitive
> > PTransform is a PTransform that doesn't contain any other PTransforms.
> >
> > Unfortunately, while Ryan was attempting to enable testing of validates
> > runner tests for Flink using the new portability libraries, he ran into
> > an issue where the Apache Beam Java SDK allows for a person to construct
> > a PTransform that has zero sub PTransforms and also isn't one of the
> > defined Apache Beam primitives. In this case the PTransform was trivial
> > as it was not applying any additional transforms to input PCollection
> > and just returning it. This caused an issue within the portability
> > libraries since they couldn't handle this structure.
> >
> > To solve this issue, I had proposed that we modify the portability
> > library that does fusion to use a whitelist of primitives preventing the
> > issue from happening. This solved the problem but caused an issue for
> > Thomas as he was relying on this behaviour of PTransforms with zero sub
> > transforms being primitives. Thomas has a use-case where he wants to
> > expose the internal Flink Kafka and Kinesis connectors and to build
> > Apache Beam pipelines that use the Flink native sources/sinks. I'll call
> > these "native" PTransforms, since they aren't part of the Apache Beam
> > model and are runner specific.
> >
> > This brings up two topics:
> > A) What should we do with these "empty" PTransforms?
> > B) What should we do with "native" PTransforms?
> >
> > The typical flow of a pipeline representation for a portable pipeline is:
> > language specific representation -> proto representation -> job service
> > -> shared libraries that simplify/replace the proto representation with
> > a simplified version (e.g. fusion) -> runner specific conversion to
> > native runner concepts (e.g. GBK -> runner implementation of GBK)
> >
> > --
> >
> > A) What should we do with these "empty" PTransforms?
> >
> > To give a little more detail, these transforms typically can happen if
> > people have conditional logic such as loops that would perform an
> > expansion but do nothing if the condition is immediately unsatisfied. So
> > allowing for PTransforms that are empty is useful when building a
> pipeline.
> >
> > What should we do:
> > A1) Stick with the whitelist of primitive PTransforms.
> > A2) When converting the pipeline from language specific representation
> > into the proto representation, drop any "empty" PTransforms. This means
> > that the pipeline representation that is sent to the runner doesn't
> > contain the offending type of PTransform and the shared libraries
> > wouldn't have to change.
> > A3) Handle the "empty" PTransform case within all of the shared
> libraries.
> >
> > I like doing both A1 and A2. A1 since it helps simplify the shared
> > libraries since we know the whole list of 

Re: [FYI] Paper of Building Beam Runner for IBM Streams

2018-09-10 Thread Maximilian Michels

Excellent write-up. Thank you!

On 09.09.18 20:43, Jean-Baptiste Onofré wrote:

Good idea. It could also help people who wants to create runners.

Regards
JB

On 09/09/2018 13:00, Manu Zhang wrote:

Hi all,

I've spent the weekend reading Challenges and Experiences in Building an
Efficient Apache Beam Runner For IBM Streams
 from the August
proceeding of PVLDB. It's quite enjoyable and urges me to reflect on how
I (should've) implemented the Gearpump runner. I believe it will be
beneficial to have more such papers and discussions as sharing design
choices and lessons from various runners.

Enjoy it !
Manu Zhang




Re: PTransforms and Fusion

2018-09-10 Thread Maximilian Michels

A) What should we do with these "empty" PTransforms?


We can't translate them, so dropping them seems the most reasonable 
choice. Should we throw an error/warning to make the user aware of this? 
Otherwise might be unexpected for the user.



A3) Handle the "empty" PTransform case within all of the shared libraries.


What can we do at this point other than dropping them?


B) What should we do with "native" PTransforms?


I support B1 and B2 as well. Non-portable PTransforms should be 
discouraged in the long run. However, the available PTransforms are not 
even consistent across the different SDKs yet (e.g. no streaming 
connectors in Python), so we should continue to provide a way to utilize 
the "native" transforms of a Runner.


-Max

On 07.09.18 19:15, Lukasz Cwik wrote:
A primitive transform is a PTransform that has been chosen to have no 
default implementation in terms of other PTransforms. A primitive 
transform therefore must be implemented directly by a pipeline runner in 
terms of pipeline-runner-specific concepts. An initial list of primitive 
PTransforms were defined in [2] and has since been updated in [3].


As part of the portability effort, libraries that are intended to be 
shared across multiple runners are being developed to support their 
migration to a portable execution model. One of these is responsible for 
fusing multiple primitive PTransforms together into a pipeline runner 
specific concept. This library made the choice that a primitive 
PTransform is a PTransform that doesn't contain any other PTransforms.


Unfortunately, while Ryan was attempting to enable testing of validates 
runner tests for Flink using the new portability libraries, he ran into 
an issue where the Apache Beam Java SDK allows for a person to construct 
a PTransform that has zero sub PTransforms and also isn't one of the 
defined Apache Beam primitives. In this case the PTransform was trivial 
as it was not applying any additional transforms to input PCollection 
and just returning it. This caused an issue within the portability 
libraries since they couldn't handle this structure.


To solve this issue, I had proposed that we modify the portability 
library that does fusion to use a whitelist of primitives preventing the 
issue from happening. This solved the problem but caused an issue for 
Thomas as he was relying on this behaviour of PTransforms with zero sub 
transforms being primitives. Thomas has a use-case where he wants to 
expose the internal Flink Kafka and Kinesis connectors and to build 
Apache Beam pipelines that use the Flink native sources/sinks. I'll call 
these "native" PTransforms, since they aren't part of the Apache Beam 
model and are runner specific.


This brings up two topics:
A) What should we do with these "empty" PTransforms?
B) What should we do with "native" PTransforms?

The typical flow of a pipeline representation for a portable pipeline is:
language specific representation -> proto representation -> job service 
-> shared libraries that simplify/replace the proto representation with 
a simplified version (e.g. fusion) -> runner specific conversion to 
native runner concepts (e.g. GBK -> runner implementation of GBK)


--

A) What should we do with these "empty" PTransforms?

To give a little more detail, these transforms typically can happen if 
people have conditional logic such as loops that would perform an 
expansion but do nothing if the condition is immediately unsatisfied. So 
allowing for PTransforms that are empty is useful when building a pipeline.


What should we do:
A1) Stick with the whitelist of primitive PTransforms.
A2) When converting the pipeline from language specific representation 
into the proto representation, drop any "empty" PTransforms. This means 
that the pipeline representation that is sent to the runner doesn't 
contain the offending type of PTransform and the shared libraries 
wouldn't have to change.

A3) Handle the "empty" PTransform case within all of the shared libraries.

I like doing both A1 and A2. A1 since it helps simplify the shared 
libraries since we know the whole list of primitives we need to 
understand and A2 because it removes noise within the pipeline shape 
from its representation.


--

B) What should we do with "native" PTransforms?

Some approaches that we could take as a community:

B1) Prevent the usage of "native" PTransforms within Apache Beam since 
they hurt portability of pipelines across runners. This can be done by 
specifically using whitelists of allowed primitive PTransforms in the 
shared libraries and explicitly not allowing for shared libraries to 
have extension points customizing this.


B2) We embrace that closed systems internal to companies will want to 
use their own extensions and enable support for "native" PTransforms but 
actively discourage "native" PTransforms in the open ecosystem.


B3) We embrace and allow for "native" PTransforms in the open ecosystem.


Re: PR/6343: Adding support for MustFollow

2018-09-10 Thread Maximilian Michels
This is a great idea but I share Lukasz' doubts about this being a 
universal solution for awaiting some action in a pipeline.


I wonder, wouldn't it work to not pass in a PCollection, but instead 
wrap a DoFn which internally ensures the correct triggering behavior? 
All runners which correctly materialize the side input with the first 
window triggering should support it correctly.


Apart from that, couldn't you simply use the @Setup method of a DoFn in 
your example?


-Max

On 07.09.18 23:12, Peter Li wrote:

Thanks!  I (PR author) agree with all that.

On the unbounded triggering issue, I can see 2 reasonable desired behaviors:
   1) The collection to follow is bounded and the intent is to wait for 
the entire collection to be processed.
   2) The collection to follow has windows that in some flexible sense 
align with the windows of the collection that is supposed to be waiting, 
and the intent is for the waiting to apply within window.


If either or both of these behaviors can be supported by something like 
the proposed mechanism, then I think that's a reasonable thing to have 
provided it's well documented.


The other high-level issue I see is whether this should be A) a free 
PTransform on collections, or B) something like a call on ParDo.  In the 
PR it's (A), but I can imagine for future more native implementation on 
some runners it could be more natural as (B).


On Fri, Sep 7, 2018 at 11:58 AM Lukasz Cwik > wrote:


A contributor opened a PR[1] to add support for a PTransform that
forces one PTransform to be executed before another by using side
input readiness as a way to defer execution.

They have provided this example usage:
# Ensure that output dir is created before attempting to write
output files.
output_dir_ready = pipeline | beam.Create([output_path]) |
CreateOutputDir()
output_pcoll | MustFollow(output_dir_ready) | WriteOutput()

The example the PR author provided works since they are working in
the global window with no triggers that have early firings and they
are only processing a single element so `CreateOutputDir()` is
invoked only once. The unused side input access forces all runners
to wait till `CreateOutputDir()` is invoked before `WriteOutput()`.

Now imagine a user wanted to use `MustFollow` but incorrectly setup
the trigger for the side input and it has an early firing (such as
an after count 1 trigger) and has multiple directories they want to
create. The side input would become ready before **all** of
`CreateOutputDir()` was invoked. This would mean that `MustFollow`
would be able to access the side input and hence allow
`WriteOutput()` to happen. The example above is still a bounded
pipeline and a runner could choose to execute it as I described.

The contract for side inputs is that the side input PCollection must
have at least one firing based upon the upstream windowing strategy
for the "requested" window or the runner must be sure that the no
such window could ever be produced before it can be accessed by a
consumer.

My concern with the PR is two fold:
1) I'm not sure if this works for all runners in both bounded and
unbounded pipelines and could use insights from others:

I believe if the user is using a trigger which only fires once and
that they guarantee that the main input window mapping to side input
window mapping is 1:1 then they are likely to get the expected
behavior, WriteOutput() always will happen once CreateOutputDir()
for a given window has finished.

2) If the solution works, the documentation around the limitations
of how the PTransform works needs a lot more detail and potentially
"error" checking of the windowing strategy.

1: https://github.com/apache/beam/pull/6343



Re: [PROPOSAL] Beam jira bot exceptions

2018-09-10 Thread Jean-Baptiste Onofré
+1

It would be great to be able to configure "known issues" ;)

Regards
JB

On 10/09/2018 15:07, Etienne Chauchot wrote:
> Hi,
> 
> The bot for dependency checks opens tickets automatically when it sees
> staled dependencies. But in some cases it is perfectly normal. 
> For example, in elasticsearchIO we have the core module (used by the
> users) that is compatible with all the versions (for backward
> compatibility and ease of use). But there is also one test module per
> supported version (v2, v5, v6) that runs tests against an embedded
> version of ES. These modules contain ES deps in v2, v5 and v6.
> In such a case the bot should not open a ticket. Shall we configure it
> with exceptions ?
> 
> Note that for now I closed the related tickets.
> 
> Best
> Etienne

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


unsubscribe

2018-09-10 Thread Alexander Kolchin
On Mon, Sep 10, 2018 at 3:54 PM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/659/display/redirect?page=changes
> >
>
> Changes:
>
> [robertwb] Update container versions of NumPy and TensorFlow.
>
> --
> Started by timer
> [EnvInject] - Loading node environment variables.
> Building remotely on beam2 (beam) in workspace <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/>
>  > git rev-parse --is-inside-work-tree # timeout=10
> Fetching changes from the remote Git repository
>  > git config remote.origin.url https://github.com/apache/beam.git #
> timeout=10
> Fetching upstream changes from https://github.com/apache/beam.git
>  > git --version # timeout=10
>  > git fetch --tags --progress https://github.com/apache/beam.git
> +refs/heads/*:refs/remotes/origin/*
> +refs/pull/${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*
>  > git rev-parse origin/master^{commit} # timeout=10
> Checking out Revision 9b68f926628d727e917b6a33ccdafcfe693eef6a
> (origin/master)
>  > git config core.sparsecheckout # timeout=10
>  > git checkout -f 9b68f926628d727e917b6a33ccdafcfe693eef6a
> Commit message: "Merge pull request #6314 from robertwb/update-tf"
>  > git rev-list --no-walk 6aa2cf55376481de83a801cd9d79fb775916ee88 #
> timeout=10
> Cleaning workspace
>  > git rev-parse --verify HEAD # timeout=10
> Resetting working tree
>  > git reset --hard # timeout=10
>  > git clean -fdx # timeout=10
> [EnvInject] - Executing scripts and injecting environment variables after
> the SCM step.
> [EnvInject] - Injecting as environment variables the properties content
> SPARK_LOCAL_IP=127.0.0.1
>
> [EnvInject] - Variables injected successfully.
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins2993188719564576829.sh
> + gcloud container clusters get-credentials io-datastores
> --zone=us-central1-a --verbosity=debug
> DEBUG: Running [gcloud.container.clusters.get-credentials] with arguments:
> [--verbosity: "debug", --zone: "us-central1-a", NAME: "io-datastores"]
> Fetching cluster endpoint and auth data.
> DEBUG: Saved kubeconfig to /home/jenkins/.kube/config
> kubeconfig entry generated for io-datastores.
> INFO: Display format "default".
> DEBUG: SDK update checks are disabled.
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins8940561565152653460.sh
> + cp /home/jenkins/.kube/config <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/config-beam-performancetests-mongodbio-it-659
> >
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins5506506063862735166.sh
> + kubectl --kubeconfig=<
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/config-beam-performancetests-mongodbio-it-659>
> create namespace beam-performancetests-mongodbio-it-659
> namespace "beam-performancetests-mongodbio-it-659" created
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins6366458138821826994.sh
> ++ kubectl config current-context
> + kubectl --kubeconfig=<
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/config-beam-performancetests-mongodbio-it-659>
> config set-context gke_apache-beam-testing_us-central1-a_io-datastores
> --namespace=beam-performancetests-mongodbio-it-659
> Context "gke_apache-beam-testing_us-central1-a_io-datastores" modified.
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins4547051839724193689.sh
> + rm -rf <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/PerfKitBenchmarker
> >
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins5689562862061444267.sh
> + rm -rf <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.beam_env
> >
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins4465905129866843493.sh
> + rm -rf <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.perfkit_env
> >
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins4609654158763053560.sh
> + virtualenv <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.beam_env>
> --system-site-packages
> New python executable in <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.beam_env/bin/python2
> >
> Also creating executable in <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.beam_env/bin/python
> >
> Installing setuptools, pkg_resources, pip, wheel...done.
> Running virtualenv with interpreter /usr/bin/python2
> [beam_PerformanceTests_MongoDBIO_IT] $ /bin/bash -xe
> /tmp/jenkins4104139872799500028.sh
> + virtualenv <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.perfkit_env>
> --system-site-packages
> New python executable in <
> https://builds.apache.org/job/beam_PerformanceTests_MongoDBIO_IT/ws/env/.perfkit_env/bin/python2
> >
> Also creating executable in 

[PROPOSAL] Beam jira bot exceptions

2018-09-10 Thread Etienne Chauchot
Hi,

The bot for dependency checks opens tickets automatically when it sees staled 
dependencies. But in some cases it is
perfectly normal. 
For example, in elasticsearchIO we have the core module (used by the users) 
that is compatible with all the versions
(for backward compatibility and ease of use). But there is also one test module 
per supported version (v2, v5, v6) that
runs tests against an embedded version of ES. These modules contain ES deps in 
v2, v5 and v6.
In such a case the bot should not open a ticket. Shall we configure it with 
exceptions ?

Note that for now I closed the related tickets.

Best
Etienne

Re: [jira] [Created] (BEAM-5237) Beam Dependency Update Request: org.elasticsearch.client:transport 6.4.0

2018-09-10 Thread Alexander Kolchin
unsubscribe

On Mon, Aug 27, 2018 at 3:18 PM Beam JIRA Bot (JIRA) 
wrote:

> Beam JIRA Bot created BEAM-5237:
> ---
>
>  Summary: Beam Dependency Update Request:
> org.elasticsearch.client:transport 6.4.0
>  Key: BEAM-5237
>  URL: https://issues.apache.org/jira/browse/BEAM-5237
>  Project: Beam
>   Issue Type: Sub-task
>   Components: dependencies
> Reporter: Beam JIRA Bot
> Assignee: Etienne Chauchot
>
>
>
>
> 2018-08-27 12:17:33.728604
>
> Please review and upgrade the org.elasticsearch.client:transport
> to the latest version 6.4.0
>
> cc: [~timrobertson100],
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>


-- 
Alexander Kolchin
akolc...@gmail.com


Beam Dependency Check Report (2018-09-10)

2018-09-10 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
google-cloud-bigquery
0.25.0
1.5.0
2017-06-26
2017-06-26BEAM-5079
google-cloud-pubsub
0.26.0
0.37.2
2017-06-26
2017-06-26BEAM-5223
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
org.assertj:assertj-core
2.5.0
3.11.1
2016-07-03
2018-08-28BEAM-5289
com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2014-10-25
2017-12-11BEAM-4874
biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2011-04-01
2013-02-27BEAM-4884
com.gradle:build-scan-plugin
1.13.1
1.16
2018-04-10
2018-08-27BEAM-5224
org.apache.cassandra:cassandra-all
3.9
3.11.3
2016-09-26
2018-07-25BEAM-5083
commons-cli:commons-cli
1.2
1.4
2009-03-19
2017-03-09BEAM-4896
commons-codec:commons-codec
1.9
1.11
2013-12-21
2017-10-17BEAM-4898
org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-03
2018-07-13BEAM-4900
com.typesafe:config
1.3.0
1.3.3
2015-05-08
2018-02-21BEAM-4902
de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-21BEAM-4904
de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-10
2018-06-21BEAM-4905
org.elasticsearch:elasticsearch
5.6.3
6.4.0
2017-10-06
2018-08-18BEAM-5225
org.elasticsearch:elasticsearch-hadoop
5.0.0
6.4.0
2016-10-26
2018-08-18BEAM-5226
org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.4.0
2017-10-06
2018-08-18BEAM-5227
org.elasticsearch.test:framework
5.6.3
6.4.0
2017-10-06
2018-08-18BEAM-5228
net.ltgt.gradle:gradle-apt-plugin
0.13
0.18
2017-11-01
2018-07-23BEAM-4924
com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.15.0
2018-01-30
2018-08-27BEAM-5229
com.google.code.gson:gson
2.7
2.8.5
2016-06-14
2018-05-22BEAM-4947
com.google.guava:guava
20.0
26.0-jre
2016-10-28
2018-08-01BEAM-5085
org.apache.hbase:hbase-common
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4951
org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4952
org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4953
org.apache.hbase:hbase-server
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4954
org.apache.hbase:hbase-shaded-client
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4955
org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2017-08-16BEAM-4956
org.apache.hive:hive-cli
2.1.0
3.1.0.3.0.1.0-136
2016-06-16
2018-09-10BEAM-5344
org.apache.hive:hive-common
2.1.0
3.1.0.3.0.1.0-136
2016-06-16
2018-09-10BEAM-5345
org.apache.hive:hive-exec
2.1.0
3.1.0.3.0.1.0-136
2016-06-16
2018-09-10BEAM-5346
org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.0.3.0.1.0-136
2016-06-16
2018-09-10BEAM-5347
net.java.dev.javacc:javacc
4.0
7.0.3
2006-03-17
2017-11-06BEAM-4969
org.slf4j:jcl-over-slf4j
1.7.25
1.8.0-beta2
2017-03-16
2018-03-21BEAM-5234
net.java.dev.jna:jna
4.1.0
4.5.2
2014-03-06
2018-07-12BEAM-4973
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-4975
org.apache.kudu:kudu-client
1.4.0
1.7.1
2017-06-05
2018-05-30BEAM-5087
io.dropwizard.metrics:metrics-core
3.1.2
4.1.0-rc2
2015-04-26
2018-05-03BEAM-4977
org.mongodb:mongo-java-driver
3.2.2
3.8.1
2016-02-15
2018-08-20BEAM-5235
io.netty:netty-all
4.1.17.Final
5.0.0.Alpha2
2017-11-08

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-10 Thread Jean-Baptiste Onofré
+1 (binding)

Tested successfully on Beam Samples.

Thanks !

Regards
JB

On 07/09/2018 23:56, Charles Chen wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #1 for the version
> 2.7.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
>  [2], which is signed with the key with
> fingerprint 45C60AAAD115F560 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.7.0-RC1" [5],
> * website pull request listing the release and publishing the API
> reference manual [6].
> * Java artifacts were built with Gradle 4.8 and OpenJDK
> 1.8.0_181-8u181-b13-1~deb9u1-b13.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org  [2].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> Charles
> 
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
> [2] https://dist.apache.org/repos/dist/dev/beam/2.7.0
> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1046/
> [5] https://github.com/apache/beam/tree/v2.7.0-RC1
> [6] https://github.com/apache/beam-site/pull/549

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] Test performance of basic Apache Beam operations

2018-09-10 Thread Łukasz Gajowy
Hello everyone,

thank you for all your comments to the proposal. To sum up:

A set of performance tests exercising Core Beam Transforms (ParDo,
GroupByKey, CoGroupByKey, Combine) will be implemented for Java and Python
SDKs. Those tests will allow to:

   - measure performance of the transforms on various runners
   - exercise the transforms by creating stressful conditions and big loads
   using Synthetic Source and Synthetic Step API (delays, keeping cpu busy or
   asleep, processing large keys and values, performing fanout or reiteration
   of inputs)
   - run both in batch and streaming context
   - gather various metrics
   - notice regressions by comparing data from consequent Jenkins runs

Metrics (runtime, consumed bytes, memory usage, split/bundle count) can be
gathered during test invocations. We will start with runtime and leverage
Metrics API to collect the other metrics in later phases of development.
The tests will be fully configurable through pipeline options and it will
be possible to run any custom scenarios manually. However, a representative
set of testing scenarios will be run periodically using Jenkins.

Regards,
Łukasz

śr., 5 wrz 2018 o 20:31 Rafael Fernandez  napisał(a):

> neat! left a comment or two
>
> On Mon, Sep 3, 2018 at 3:53 AM Łukasz Gajowy  wrote:
>
>> Hi all!
>>
>> I'm bumping this (in case you missed it). Any feedback and questions are
>> welcome!
>>
>> Best regards,
>> Łukasz
>>
>> pon., 13 sie 2018 o 13:51 Jean-Baptiste Onofré 
>> napisał(a):
>>
>>> Hi Lukasz,
>>>
>>> Thanks for the update, and the abstract looks promising.
>>>
>>> Let me take a look on the doc.
>>>
>>> Regards
>>> JB
>>>
>>> On 13/08/2018 13:24, Łukasz Gajowy wrote:
>>> > Hi all,
>>> >
>>> > since Synthetic Sources API has been introduced in Java and Python SDK,
>>> > it can be used to test some basic Apache Beam operations (i.e.
>>> > GroupByKey, CoGroupByKey Combine, ParDo and ParDo with SideInput) in
>>> > terms of performance. This, in brief, is why we'd like to share the
>>> > below proposal:
>>> >
>>> > _
>>> https://docs.google.com/document/d/1PuIQv4v06eosKKwT76u7S6IP88AnXhTf870Rcj1AHt4/edit?usp=sharing_
>>> >
>>> > Let us know what you think in the document's comments. Thank you in
>>> > advance for all the feedback!
>>> >
>>> > Łukasz
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>


Re: [portablility] metrics interrogations

2018-09-10 Thread Etienne Chauchot
Hi all,@Luke, @Alex I have a general question related to metrics in the Fn API: 
as the communication between runner
harness and SDK harness is done on a bundle basis. When the runner harness 
sends data to the sdk harness to execute a
transform that contains metrics, does it:   1. send metrics values (for the 
ones defined in the transform) alongside
with data and receive an updated value of the metrics from the sdk harness when 
the bundle is finished processing?
   2. or does it send only the data and the sdk harness responds with a diff 
value of the metrics so that the runner can
update them in its side?
My bet is option 2. But can you confirm?
Thanks
Etienne
Le jeudi 19 juillet 2018 à 15:10 +0200, Etienne Chauchot a écrit :
> Thanks for the confirmations Luke.
> Le mercredi 18 juillet 2018 à 07:56 -0700, Lukasz Cwik a écrit :
> > On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot  
> > wrote:
> > > Hi,
> > > Luke, Alex, I have some portable metrics interrogations, can you confirm 
> > > them ? 
> > > 
> > > 1 - As it is the SDK harness that will run the code of the UDFs, if a UDF 
> > > defines a metric, then the SDK harness
> > > will give updates through GRPC calls to the runner so that the runner 
> > > could update metrics cells, right?
> > 
> > Yes. 
> > > 2 - Alex, you mentioned in proto and design doc that there will be no 
> > > aggreagation of metrics. But some runners
> > > (spark/flink) rely on accumulators and when they are merged, it triggers 
> > > the merging of the whole chain to the
> > > metric cells. I know that Dataflow does not do the same, it uses non 
> > > agregated metrics and sends them to an
> > > aggregation service. Will there be a change of paradigm with portability 
> > > for runners that merge themselves ? 
> > 
> > There will be local aggregation of metrics scoped to a bundle; after the 
> > bundle is finished processing they are
> > discarded. This will require some kind of global aggregation support from a 
> > runner, whether that runner does it via
> > accumulators or via an aggregation service is up to the runner.
> > > 3 - Please confirm that the distinction between attempted and committed 
> > > metrics is not the business of portable
> > > metrics. Indeed, it does not involve communication between the runner 
> > > harness and the SDK harness as it is a
> > > runner only matter. I mean, when a runner commits a bundle it just 
> > > updates its committed metrics and do not need
> > > to inform the SDK harness. But, of course, when the user requests 
> > > committed metrics through the SDK, then the SDK
> > > harness will ask the runner harness to give them.
> > > 
> > > 
> >  You are correct in saying that during execution, the SDK does not 
> > differentiate between attempted and committed
> > metrics and only the runner does. We still lack an API definition and 
> > contract for how an SDK would query for
> > metrics from a runner but your right in saying that an SDK could request 
> > committed metrics and the Runner would
> > supply them some how. 
> > > Thanks
> > > BestEtienne
> > > 
> > > 

Build failed in Jenkins: beam_Release_Gradle_NightlySnapshot #167

2018-09-10 Thread Apache Jenkins Server
See 


--
[...truncated 18.59 MB...]
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:test' as it has no 
source files and no previous output files.
:beam-vendor-sdks-java-extensions-protobuf:test (Thread[Task worker for 
':',5,main]) completed. Took 0.0 secs.
:beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
 (Thread[Task worker for ':',5,main]) started.

> Task 
> :beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
Caching disabled for task 
':beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses':
 Caching has not been enabled for the task
Task 
':beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses'
 is not up-to-date because:
  Task has not declared any outputs despite executing actions.
:beam-vendor-sdks-java-extensions-protobuf:validateShadedJarDoesntLeakNonOrgApacheBeamClasses
 (Thread[Task worker for ':',5,main]) completed. Took 0.032 secs.
:beam-vendor-sdks-java-extensions-protobuf:check (Thread[Task worker for 
':',5,main]) started.

> Task :beam-vendor-sdks-java-extensions-protobuf:check
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:check' as it has no 
actions.
:beam-vendor-sdks-java-extensions-protobuf:check (Thread[Task worker for 
':',5,main]) completed. Took 0.0 secs.
:beam-vendor-sdks-java-extensions-protobuf:build (Thread[Task worker for 
':',5,main]) started.

> Task :beam-vendor-sdks-java-extensions-protobuf:build
Skipping task ':beam-vendor-sdks-java-extensions-protobuf:build' as it has no 
actions.
:beam-vendor-sdks-java-extensions-protobuf:build (Thread[Task worker for 
':',5,main]) completed. Took 0.0 secs.
:beam-website:assemble (Thread[Task worker for ':',5,main]) started.

> Task :beam-website:assemble UP-TO-DATE
Skipping task ':beam-website:assemble' as it has no actions.
:beam-website:assemble (Thread[Task worker for ':',5,main]) completed. Took 0.0 
secs.
:beam-website:setupBuildDir (Thread[Task worker for ':',5,main]) started.

> Task :beam-website:setupBuildDir
Build cache key for task ':beam-website:setupBuildDir' is 
7afd3bd0c1a9269d50131165bb8a63a1
Caching disabled for task ':beam-website:setupBuildDir': Caching has not been 
enabled for the task
Task ':beam-website:setupBuildDir' is not up-to-date because:
  No history is available.
:beam-website:setupBuildDir (Thread[Task worker for ':',5,main]) completed. 
Took 0.003 secs.
:beam-website:buildDockerImage (Thread[Task worker for ':',5,main]) started.

> Task :beam-website:buildDockerImage
Caching disabled for task ':beam-website:buildDockerImage': Caching has not 
been enabled for the task
Task ':beam-website:buildDockerImage' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Starting process 'command 'docker''. Working directory: 

 Command: docker build -t beam-website .
Successfully started process 'command 'docker''
Sending build context to Docker daemon  24.51MB
Step 1/7 : FROM ruby:2.5
 ---> 88666731c3e1
Step 2/7 : WORKDIR /ruby
 ---> Using cache
 ---> 9b7353f27cb5
Step 3/7 : RUN gem install bundler
 ---> Using cache
 ---> cd46d9b7ccbe
Step 4/7 : ADD Gemfile Gemfile.lock /ruby/
 ---> Using cache
 ---> d50c22e097f2
Step 5/7 : RUN bundle install --deployment --path $GEM_HOME
 ---> Using cache
 ---> e8881d09b465
Step 6/7 : ENV LC_ALL C.UTF-8
 ---> Using cache
 ---> 3787b82c937d
Step 7/7 : CMD sleep 3600
 ---> Using cache
 ---> 1608418b66da
Successfully built 1608418b66da
Successfully tagged beam-website:latest
:beam-website:buildDockerImage (Thread[Task worker for ':',5,main]) completed. 
Took 0.22 secs.
:beam-website:createDockerContainer (Thread[Task worker for ':',5,main]) 
started.

> Task :beam-website:createDockerContainer
Caching disabled for task ':beam-website:createDockerContainer': Caching has 
not been enabled for the task
Task ':beam-website:createDockerContainer' is not up-to-date because:
  Task has not declared any outputs despite executing actions.
Starting process 'command '/bin/bash''. Working directory: 

 Command: /bin/bash -c docker create -v 
:/repo
 -u $(id -u):$(id -g) beam-website
Successfully started process 'command '/bin/bash''
:beam-website:createDockerContainer (Thread[Task worker for ':',5,main]) 
completed. Took 2.381 secs.
:beam-website:startDockerContainer (Thread[Task worker for ':',5,main]) started.

> Task :beam-website:startDockerContainer
Caching disabled for task ':beam-website:startDockerContainer': Caching has not 
been enabled for the task
Task ':beam-website:startDockerContainer' is not