Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Tim
+1 > On 15 Sep 2018, at 01:23, Yifan Zou wrote: > > +1 > >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek >> wrote: >> +1 >> >> >> >>> On 15 Sep 2018, at 00:59, Anton Kedin wrote: >>> >>> +1 >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold wrote: +1 > On Fri, Sep 14

Re: Run Python3 tests in miniconda

2018-09-14 Thread Manu Zhang
Hi Valentyn, 1. No, I haven't. I can manually run tests with `python setup.py nosetests --test *` (BTW, it's not mentioned in the doc `pip install nose` is required). 2. Yes, I'm using Miniconda3 on Mac and I've done `source ~/miniconda3/bin/activate py3` to activate my Python3 environment. It see

Re: SplittableDoFn

2018-09-14 Thread Lukasz Cwik
Thanks to everyone who joined and for the questions asked. Ryan graciously collected notes of the discussion: https://docs.google.com/document/d/1kjJLGIiNAGvDiUCMEtQbw8tyOXESvwGeGZLL-0M06fQ/edit?usp=sharing The summary was that bringing BoundedSource/UnboundedSource into using a unified backlog-r

Re: Run Python3 tests in miniconda

2018-09-14 Thread Valentyn Tymofieiev
Hi Manu, I saw your PR https://github.com/apache/beam/pull/6397 (thanks a lot!) - did you resolve the issue with setup? I have not tried Miniconda with Beam myself. Perhaps you could describe your setup in more detail, so that I (or other folks on the list) could try to reproduce the issue? Also,

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Yifan Zou
+1 On Fri, Sep 14, 2018 at 4:20 PM David Morávek wrote: > +1 > > > > On 15 Sep 2018, at 00:59, Anton Kedin wrote: > > +1 > > On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold wrote: > >> +1 >> >> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang wrote: >> >>> +1 >>> >>> On Fri, Sep 14, 2018 at 3:15 PM

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread David Morávek
+1 > On 15 Sep 2018, at 00:59, Anton Kedin wrote: > > +1 > >> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold wrote: >> +1 >> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang wrote: >>> +1 >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde wrote: +1 > On Fri, Sep 14, 2018

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Anton Kedin
+1 On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold wrote: > +1 > > On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang wrote: > >> +1 >> >> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde wrote: >> >>> +1 >>> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay wrote: >>> +1 (binding) On Fri, S

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Alan Myrvold
+1 On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang wrote: > +1 > > On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde wrote: > >> +1 >> >> On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay wrote: >> >>> +1 (binding) >>> >>> On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik wrote: >>> +1 (binding)

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Boyuan Zhang
+1 On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde wrote: > +1 > > On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay wrote: > >> +1 (binding) >> >> On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik wrote: >> >>> +1 (binding) >>> >>> On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada >>> wrote: >>> +1

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Henning Rohde
+1 On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay wrote: > +1 (binding) > > On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik wrote: > >> +1 (binding) >> >> On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada wrote: >> >>> +1 >>> >>> On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud >>> wrote: >>> +1 >>>

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Ahmet Altay
+1 (binding) On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik wrote: > +1 (binding) > > On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada wrote: > >> +1 >> >> On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud >> wrote: >> >>> +1 >>> >>> On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik wrote: >>> There w

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Lukasz Cwik
+1 (binding) On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada wrote: > +1 > > On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud > wrote: > >> +1 >> >> On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik wrote: >> >>> There was generally positive support and good feedback[1] but it was not >>> unanimous. I w

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Pablo Estrada
+1 On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud wrote: > +1 > > On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik wrote: > >> There was generally positive support and good feedback[1] but it was not >> unanimous. I wanted to bring the donation of the Dataflow worker code base >> to Apache Beam mast

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Andrew Pilloud
+1 On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik wrote: > There was generally positive support and good feedback[1] but it was not > unanimous. I wanted to bring the donation of the Dataflow worker code base > to Apache Beam master to a vote. > > +1: Support having the Dataflow worker code as part

[VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Lukasz Cwik
There was generally positive support and good feedback[1] but it was not unanimous. I wanted to bring the donation of the Dataflow worker code base to Apache Beam master to a vote. +1: Support having the Dataflow worker code as part of Apache Beam master branch -1: Dataflow worker code should live

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-14 Thread Raghu Angadi
I would like propose one more cherrypick for RC2 : https://github.com/apache/beam/pull/6391 This is a KafkaIO bug fix. Once a user hits this bug, there is no easy work around for them, especially on Dataflow. Only work around in Dataflow is to restart or reload the job. The fix itself fairly safe

Re: PTransforms and Fusion

2018-09-14 Thread Lukasz Cwik
Robert, in my mind making the shared libraries extensible for internal usage would be B2. Exposing those native PTransforms within the community would be B3. It seems as though we can achieve both A1 and B2 by making the shared library extensible. I have this commit[1] which starts to allow the sh

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

2018-09-14 Thread Anton Kedin
Raising this topic once more. The PR[1] has been open for a while, if there is no further input, I'm going to merge it by end of day. [1]: https://github.com/apache/beam/pull/6252 Thank you, Anton On Wed, Aug 15, 2018 at 10:48 PM Tim wrote: > +1 for CREATE EXTERNAL TABLE with similar reasonin

Re: Python performance tests have been broken for a while. [BEAM-5334]

2018-09-14 Thread Mark Liu
I had some updates in another email thread. Basically, I'm trying to move python benchmarks to Gradle to avoid problematic environment setup. https://github.com/apache/beam/pull/6392 contains the changes. Mark On Fri, Sep 14, 2018 at 10:38 AM Pablo Estrada wrote: > I believe Mark Liu and Lukazs

[Proposal] Add a static PTransform.compose() method for composing transforms in a lambda expression

2018-09-14 Thread Jeff Klukas
Hello all, I'm a data engineer at Mozilla working on a first project using Beam. I've been impressed with the usability of the API as there are good built-in solutions for handling many simple transformation cases with minimal code, and wanted to discuss one bit of ergonomics that seems to be missi

Re: Python performance tests have been broken for a while. [BEAM-5334]

2018-09-14 Thread Pablo Estrada
I believe Mark Liu and Lukazs Gajowy had dome some research into this. If I understand correctly, some of the problem is due to dependency conflicts with PerfKit / Beam / Jenkins images, and it may be solved by containerizing the test environment (otherwise, pretty difficult to fix I believe). I be

Python performance tests have been broken for a while. [BEAM-5334]

2018-09-14 Thread Huygaa Batsaikhan
Hi devs, regarding BEAM-5334 , beam_PerformanceTests_Python has been broken for a really long time. Anyone interested in picking it up? Here is the history of the test. Thanks

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 4:22 PM David Morávek wrote: > Hello Robert, > > thanks for the answer! Spark allows us to sort the single partition (after > repartition), by user provided comparator, so it is definitely possible to > do secondary sort by timestamp. The "more intelligent ReduceFnRunner"

Re: SparkRunner - GroupByKey

2018-09-14 Thread David Morávek
Hello Robert, thanks for the answer! Spark allows us to sort the single partition (after repartition), by user provided comparator, so it is definitely possible to do secondary sort by timestamp. The "more intelligent ReduceFnRunner" you are talking about, is it part of Beam codebase already (I gu

Re: SparkRunner - GroupByKey

2018-09-14 Thread Robert Bradshaw
If Spark supports producing grouped elements in timestamp order, a more intelligent ReduceFnRunner can be used. (We take advantage of that in Dataflow for example.) For non-merging windows, you could also put the window itself (or some subset thereof) into the key resulting in smaller groupings. I

SparkRunner - GroupByKey

2018-09-14 Thread David Morávek
Hello, currently, we are trying to move one of our large scale batch jobs (~100TB inputs) from our Euphoria based SparkRunner to Beam's Spark runner and we came across the following issue. Because we rely on hadoop ecosystem, we need to group outputs by TaskAtt

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-14 Thread Thomas Weise
That's actually how the Flink runner already works - bundle processing starts when elements are available (see FlinkExecutableStageFunction for batch mode). But we still have the possibility of the SDK getting concurrent requests due to parallelism (and pipelined execution). Thanks, Thomas On F

Re: Discussion: Scheduling across runner and SDKHarness in Portability framework

2018-09-14 Thread Robert Bradshaw
Currently the best solution we've come up with is that we must process an unbounded number of bundles concurrently to avoid deadlock. Especially in the batch case, this may be wasteful as we bring up workers for many stages that are not actually executable until upstream stages finish. Since it may

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 10:02 AM Romain Manni-Bucau wrote: > > Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw a > écrit : > >> On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau >> wrote: >> >>> Well IBM runner is outside Beam for instance so this is not really a >>> point IMHO. >>> >>> My view

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Romain Manni-Bucau
Le ven. 14 sept. 2018 à 09:48, Robert Bradshaw a écrit : > On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau > wrote: > >> Well IBM runner is outside Beam for instance so this is not really a >> point IMHO. >> >> My view is simple: >> 1. does this module bring anything to Beam as a project: I u

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Stephan Ewen
+1 (non googler) I think this is actually a nice move. Even if there is no immediate end-user benefit (no one can directly run it), it will probably be good and valuable code for other runners to learn and borrow from, so there is benefit for other developers. Plus, it eases the life of some othe

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-14 Thread Alexey Romanenko
Perhaps it could help, but I run simple WordCount (built with Beam 2.7) on YARN/Spark (HDP Sandbox) cluster and it worked fine for me. > On 14 Sep 2018, at 06:56, Romain Manni-Bucau wrote: > > Hi Charles, > > I didn't get enough time to check deeply but it is clearly a dependency issue > and

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Robert Bradshaw
On Fri, Sep 14, 2018 at 8:00 AM Romain Manni-Bucau wrote: > Well IBM runner is outside Beam for instance so this is not really a point > IMHO. > > My view is simple: > 1. does this module bring anything to Beam as a project: I understand your > answer as a no (please clarify if I'm wrong) > As h