Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Thomas Weise
It would be good to engage with the Flink community and attempt to find
stable API that the runner can depend on.

Thanks,
Thomas


On Mon, Sep 17, 2018 at 6:31 AM Maximilian Michels  wrote:

> [Copying this also to the dev list]
>
> +1. A version compatibility table would be great!
>
>  > I don't know if Flink could do something like this (become a provided
>  > dep) in particular for the current case where there seems not to be
>  > API breaking changes.
>
> That doesn't work. The Flink Runner is too tightly integrated with Flink
> internals, and these internals are not always optimally decoupled. This
> fails already at the client side, e.g. when submitting a Flink job via
> Beam to a Flink cluster. Though it should be better now with the new
> Rest-based clients.
>
> On 17.09.18 09:48, Ismaël Mejía wrote:
> > In the Spark runner the user provides the core spark dependencies at
> runtime and
> > we assume that backwards compatibility is kept (in upstream Spark). We
> support
> > the whole 2.x line but we try to keep the version close to the latest
> stable
> > release.
> >
> > Notice however that we lack tests to validate that all versions do work,
> I
> > remember some issues with metrics during the migration to spark 2.x with
> older
> > versions of spark (<= 2.1). Those worked flawlessly with more recent
> versions.
> >
> > I don't know if Flink could do something like this (become a provided
> > dep) in particular for the current case where there seems not to be
> > API breaking changes.
> >
> > In any case +1 to try to get a bit the act together on this.
> >
> > On Mon, Sep 17, 2018 at 9:31 AM Robert Bradshaw 
> wrote:
> >>
> >> On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
> >>>
> >>> Do we currently maintain a finer grained list of compatibility between
> execution/runner versions and beam versions?  Is this only really a concern
> with recent Flink (sounded like at least Spark jump, too)?  I see the
> capability matrix:
> https://beam.apache.org/documentation/runners/capability-matrix/, but
> some sort of compatibility between runner versions with beam releases might
> be useful.
> >>>
> >>> I see compatibility matrix as far as beam features, but not for
> underlying runners.  Ex: something like this would save a user trying to
> get Beam working on recent Flink 1.6 and then subsequently hitting a
> (potentially not well documented) wall given known issues.
> >>
> >>
> >> +1. I was bitten by this as well.
> >>
> >> I don't know if it's worth having a compatibility matrix for each
> version (as the overlap is likely to be all or nothing in most cases), but
> it should be prominently displayed here and elsewhere. Want to send out a
> PR?
>


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Raghu Angadi
Thanks Charles. Sent cherry-pick for KafkaIO fix:
https://github.com/apache/beam/pull/6421


On Mon, Sep 17, 2018 at 10:18 AM Charles Chen  wrote:

> Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
> release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?
>
> Romain, JB: is there any way I can help with debugging the issue you're
> facing so we can unblock the release?
>
> On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:
>
>> I would like propose one more cherrypick for RC2 :
>> https://github.com/apache/beam/pull/6391
>> This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
>> work around for them, especially on Dataflow. Only work around in Dataflow
>> is to restart or reload the job.
>>
>> The fix itself fairly safe and is tested.
>> Raghu.
>>
>> On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Perhaps it could help, but I run simple WordCount (built with Beam 2.7)
>>> on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>>>
>>> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
>>> wrote:
>>>
>>> Hi Charles,
>>>
>>> I didn't get enough time to check deeply but it is clearly a dependency
>>> issue and it is not in beam spark runner itself but in another transitive
>>> module of beam. It does not happen in existing spark test cause none of
>>> them are in a cluster (even just with 1 worker) but this seems to be a
>>> regression since 2.6 works OOTB.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>>
>>> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>>>
 Romain and JB, can you please add the results of your investigations
 into the errors you've seen above?  Given that the existing SparkRunner
 tests pass for this RC, and that the integration test you ran is in another
 repo that is not continuously tested with Beam, it is not clear how we
 should move forward and whether this is a blocking issue, unless we can
 find a root cause in Beam.

 On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot 
 wrote:

> Hi all,
>
> on a performance and functional regression stand point I see no
> regression:
>
> I looked at nexmark graphs "output pcollection size" and "execution
> time" around release cut date on dataflow, spark, flink and direct runner
> in batch and streaming modes. There seems to be no regression.
>
> Etienne
>
> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>
> The SparkRunner validation test (here:
> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
> passes on my machine.  It looks like we are likely missing test coverage
> where Romain is hitting issues.
>
> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:
>
> Could anyone else help with looking at these issues earlier?
>
> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
> Im running this main [1] through this IT [2]. Was working fine since
> ~1 year but 2.7.0 broke it. Didnt investigate more but can have a look
> later this month if it helps.
>
> [1]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
> [2]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>
> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>
> Romain: can you give more details on the failure you're encountering,
> i.e. how you are performing this validation?
>
> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
> wrote:
>
> Hi,
>
> weird, I didn't have it on Beam samples. Let me try to reproduce and I
> will create the Jira.
>
> Regards
> JB
>
> On 11/09/2018 11:44, Romain Manni-Bucau wrote:
> > -1, seems spark integration is broken (tested with spark 2.3.1 and
> 2.2.1):
> >
> > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0
> (TID 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot
> assign instance of scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org
>  <
> http://org.apache.spark.rdd.RDD.org
> >$apache$spark$rdd$RDD$$dependencies_
> 

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-17 Thread Lukasz Cwik
Thanks all, closing the vote with 18 +1s, 5 of which are binding.

I'll try to get this code out and hopefully don't have any legal issues
within Google or with ASF to perform the donation. Will keep the community
up to date.

On Mon, Sep 17, 2018 at 3:28 PM Ankur Chauhan  wrote:

> +1
>
> Sent from my iPhone
>
> On Sep 17, 2018, at 15:26, Ankur Goenka  wrote:
>
> +1
>
> On Sun, Sep 16, 2018 at 3:20 AM Maximilian Michels  wrote:
>
>> +1 (binding)
>>
>> On 15.09.18 20:07, Reuven Lax wrote:
>> > +1
>> >
>> > On Sat, Sep 15, 2018 at 9:40 AM Rui Wang > > > wrote:
>> >
>> > +1
>> >
>> > -Rui
>> >
>> > On Sat, Sep 15, 2018 at 12:32 AM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> >
>> > +1 (binding)
>> >
>> > On Sat, Sep 15, 2018 at 6:44 AM Tim > > > wrote:
>> >
>> > +1
>> >
>> > On 15 Sep 2018, at 01:23, Yifan Zou > > > wrote:
>> >
>> >> +1
>> >>
>> >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek
>> >> mailto:david.mora...@gmail.com>>
>> >> wrote:
>> >>
>> >> +1
>> >>
>> >>
>> >>
>> >> On 15 Sep 2018, at 00:59, Anton Kedin
>> >> mailto:ke...@google.com>> wrote:
>> >>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold
>> >>> mailto:amyrv...@google.com>>
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang
>> >>> mailto:boyu...@google.com>>
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde
>> >>> > >>> > wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet
>> >>> Altay > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:35 PM,
>> >>> Lukasz Cwik > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:34 PM
>> >>> Pablo Estrada > >>> >
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:32
>> >>> PM Andrew Pilloud
>> >>> > >>> >
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at
>> >>> 2:31 PM Lukasz Cwik
>> >>> > >>> >
>> wrote:
>> >>>
>> >>> There was generally
>> >>> positive support and
>> >>> good feedback[1] but
>> >>> it was not unanimous.
>> >>> I wanted to bring the
>> >>> donation of the
>> >>> Dataflow worker code
>> >>> base to Apache Beam
>> >>> master to a vote.
>> >>>
>> >>> +1: Support having
>> >>> the Dataflow worker
>> >>> code as part of
>> >>> Apache Beam master
>> branch
>> >>> -1: Dataflow worker
>> >>> code should live
>> >>> elsewhere
>> >>>
>> >>> 1:
>> >>>
>> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
>> >>>
>> >>>
>>
>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-17 Thread Thomas Weise
+1 (binding)

On Mon, Sep 17, 2018 at 3:27 PM Ankur Goenka  wrote:

> +1
>
> On Sun, Sep 16, 2018 at 3:20 AM Maximilian Michels  wrote:
>
>> +1 (binding)
>>
>> On 15.09.18 20:07, Reuven Lax wrote:
>> > +1
>> >
>> > On Sat, Sep 15, 2018 at 9:40 AM Rui Wang > > > wrote:
>> >
>> > +1
>> >
>> > -Rui
>> >
>> > On Sat, Sep 15, 2018 at 12:32 AM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> >
>> > +1 (binding)
>> >
>> > On Sat, Sep 15, 2018 at 6:44 AM Tim > > > wrote:
>> >
>> > +1
>> >
>> > On 15 Sep 2018, at 01:23, Yifan Zou > > > wrote:
>> >
>> >> +1
>> >>
>> >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek
>> >> mailto:david.mora...@gmail.com>>
>> >> wrote:
>> >>
>> >> +1
>> >>
>> >>
>> >>
>> >> On 15 Sep 2018, at 00:59, Anton Kedin
>> >> mailto:ke...@google.com>> wrote:
>> >>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold
>> >>> mailto:amyrv...@google.com>>
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang
>> >>> mailto:boyu...@google.com>>
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde
>> >>> > >>> > wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet
>> >>> Altay > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:35 PM,
>> >>> Lukasz Cwik > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:34 PM
>> >>> Pablo Estrada > >>> >
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:32
>> >>> PM Andrew Pilloud
>> >>> > >>> >
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at
>> >>> 2:31 PM Lukasz Cwik
>> >>> > >>> >
>> wrote:
>> >>>
>> >>> There was generally
>> >>> positive support and
>> >>> good feedback[1] but
>> >>> it was not unanimous.
>> >>> I wanted to bring the
>> >>> donation of the
>> >>> Dataflow worker code
>> >>> base to Apache Beam
>> >>> master to a vote.
>> >>>
>> >>> +1: Support having
>> >>> the Dataflow worker
>> >>> code as part of
>> >>> Apache Beam master
>> branch
>> >>> -1: Dataflow worker
>> >>> code should live
>> >>> elsewhere
>> >>>
>> >>> 1:
>> >>>
>> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
>> >>>
>> >>>
>>
>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-17 Thread Ankur Chauhan
+1

Sent from my iPhone

> On Sep 17, 2018, at 15:26, Ankur Goenka  wrote:
> 
> +1
> 
>> On Sun, Sep 16, 2018 at 3:20 AM Maximilian Michels  wrote:
>> +1 (binding)
>> 
>> On 15.09.18 20:07, Reuven Lax wrote:
>> > +1
>> > 
>> > On Sat, Sep 15, 2018 at 9:40 AM Rui Wang > > > wrote:
>> > 
>> > +1
>> > 
>> > -Rui
>> > 
>> > On Sat, Sep 15, 2018 at 12:32 AM Robert Bradshaw
>> > mailto:rober...@google.com>> wrote:
>> > 
>> > +1 (binding)
>> > 
>> > On Sat, Sep 15, 2018 at 6:44 AM Tim > > > wrote:
>> > 
>> > +1
>> > 
>> > On 15 Sep 2018, at 01:23, Yifan Zou > > > wrote:
>> > 
>> >> +1
>> >>
>> >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek
>> >> mailto:david.mora...@gmail.com>>
>> >> wrote:
>> >>
>> >> +1
>> >>
>> >>
>> >>
>> >> On 15 Sep 2018, at 00:59, Anton Kedin
>> >> mailto:ke...@google.com>> wrote:
>> >>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold
>> >>> mailto:amyrv...@google.com>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang
>> >>> mailto:boyu...@google.com>>
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde
>> >>> > >>> > wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet
>> >>> Altay > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:35 PM,
>> >>> Lukasz Cwik > >>> > wrote:
>> >>>
>> >>> +1 (binding)
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:34 PM
>> >>> Pablo Estrada > >>> > wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at 2:32
>> >>> PM Andrew Pilloud
>> >>> > >>> >
>> >>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Fri, Sep 14, 2018 at
>> >>> 2:31 PM Lukasz Cwik
>> >>> > >>> > 
>> >>> wrote:
>> >>>
>> >>> There was generally
>> >>> positive support and
>> >>> good feedback[1] but
>> >>> it was not unanimous.
>> >>> I wanted to bring the
>> >>> donation of the
>> >>> Dataflow worker code
>> >>> base to Apache Beam
>> >>> master to a vote.
>> >>>
>> >>> +1: Support having
>> >>> the Dataflow worker
>> >>> code as part of
>> >>> Apache Beam master branch
>> >>> -1: Dataflow worker
>> >>> code should live
>> >>> elsewhere
>> >>>
>> >>> 1:
>> >>> 
>> >>> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
>> >>>
>> >>>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-17 Thread Ankur Goenka
+1

On Sun, Sep 16, 2018 at 3:20 AM Maximilian Michels  wrote:

> +1 (binding)
>
> On 15.09.18 20:07, Reuven Lax wrote:
> > +1
> >
> > On Sat, Sep 15, 2018 at 9:40 AM Rui Wang  > > wrote:
> >
> > +1
> >
> > -Rui
> >
> > On Sat, Sep 15, 2018 at 12:32 AM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> >
> > +1 (binding)
> >
> > On Sat, Sep 15, 2018 at 6:44 AM Tim  > > wrote:
> >
> > +1
> >
> > On 15 Sep 2018, at 01:23, Yifan Zou  > > wrote:
> >
> >> +1
> >>
> >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek
> >> mailto:david.mora...@gmail.com>>
> >> wrote:
> >>
> >> +1
> >>
> >>
> >>
> >> On 15 Sep 2018, at 00:59, Anton Kedin
> >> mailto:ke...@google.com>> wrote:
> >>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold
> >>> mailto:amyrv...@google.com>>
> wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang
> >>> mailto:boyu...@google.com>>
> >>> wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde
> >>>  >>> > wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet
> >>> Altay  >>> > wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> On Fri, Sep 14, 2018 at 2:35 PM,
> >>> Lukasz Cwik  >>> > wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> On Fri, Sep 14, 2018 at 2:34 PM
> >>> Pablo Estrada  >>> >
> wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at 2:32
> >>> PM Andrew Pilloud
> >>>  >>> >
> >>> wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, Sep 14, 2018 at
> >>> 2:31 PM Lukasz Cwik
> >>>  >>> >
> wrote:
> >>>
> >>> There was generally
> >>> positive support and
> >>> good feedback[1] but
> >>> it was not unanimous.
> >>> I wanted to bring the
> >>> donation of the
> >>> Dataflow worker code
> >>> base to Apache Beam
> >>> master to a vote.
> >>>
> >>> +1: Support having
> >>> the Dataflow worker
> >>> code as part of
> >>> Apache Beam master
> branch
> >>> -1: Dataflow worker
> >>> code should live
> >>> elsewhere
> >>>
> >>> 1:
> >>>
> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
> >>>
> >>>
>


Re: Migrating Beam SQL to Calcite's code generation

2018-09-17 Thread Mingmin Xu
Awesome work, we should call Calcite operator functions if available.

I haven't get time to read the PR yet, for those impacted would keep
existing implementation. One example is, I notice FLOOR/CEIL only supports
months/years recently which is quite a surprise to me.

Mingmin

On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:

> This is pretty amazing! Thank you for doing this!
>
> Regards,
> Anton
>
> On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
> wrote:
>
>> I've adapted Calcite's EnumerableCalc code generation to generate the
>> BeamCalc DoFn. The primary purpose behind this change is so we can take
>> advantage of Calcite's extensive SQL operator implementation. This deletes
>> ~11000 lines of code from Beam (with ~350 added), significantly increases
>> the set of supported SQL operators, and improves performance and
>> correctness of currently supported operators. Here is my work in progress:
>> https://github.com/apache/beam/pull/6417
>>
>> There are a few bugs in Calcite that this has exposed:
>>
>> Fixed in Calcite master:
>>
>>- CALCITE-2321 
>>- The type of a union of CHAR columns of different lengths should be 
>> VARCHAR
>>- CALCITE-2447  -
>>Some POWER, ATAN2 functions fail with NoSuchMethodException
>>
>> Pending PRs:
>>
>>- CALCITE-2529 
>>- linq4j should promote integer to floating point when generating function
>>calls
>>- CALCITE-2530 
>>- TRIM function does not throw exception when the length of trim character
>>is not 1(one)
>>
>> More work:
>>
>>- CALCITE-2404  -
>>Accessing structured-types is not implemented by the runtime
>>- (none yet) - Support multi character TRIM extension in Calcite
>>
>> I would like to push these changes in with these minor regressions. Do
>> any of these Calcite bugs block this functionality being adding to Beam?
>>
>> Andrew
>>
>

-- 

Mingmin


Re: Migrating Beam SQL to Calcite's code generation

2018-09-17 Thread Anton Kedin
This is pretty amazing! Thank you for doing this!

Regards,
Anton

On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud  wrote:

> I've adapted Calcite's EnumerableCalc code generation to generate the
> BeamCalc DoFn. The primary purpose behind this change is so we can take
> advantage of Calcite's extensive SQL operator implementation. This deletes
> ~11000 lines of code from Beam (with ~350 added), significantly increases
> the set of supported SQL operators, and improves performance and
> correctness of currently supported operators. Here is my work in progress:
> https://github.com/apache/beam/pull/6417
>
> There are a few bugs in Calcite that this has exposed:
>
> Fixed in Calcite master:
>
>- CALCITE-2321 
>- The type of a union of CHAR columns of different lengths should be 
> VARCHAR
>- CALCITE-2447  -
>Some POWER, ATAN2 functions fail with NoSuchMethodException
>
> Pending PRs:
>
>- CALCITE-2529 
>- linq4j should promote integer to floating point when generating function
>calls
>- CALCITE-2530 
>- TRIM function does not throw exception when the length of trim character
>is not 1(one)
>
> More work:
>
>- CALCITE-2404  -
>Accessing structured-types is not implemented by the runtime
>- (none yet) - Support multi character TRIM extension in Calcite
>
> I would like to push these changes in with these minor regressions. Do any
> of these Calcite bugs block this functionality being adding to Beam?
>
> Andrew
>


Migrating Beam SQL to Calcite's code generation

2018-09-17 Thread Andrew Pilloud
I've adapted Calcite's EnumerableCalc code generation to generate the
BeamCalc DoFn. The primary purpose behind this change is so we can take
advantage of Calcite's extensive SQL operator implementation. This deletes
~11000 lines of code from Beam (with ~350 added), significantly increases
the set of supported SQL operators, and improves performance and
correctness of currently supported operators. Here is my work in progress:
https://github.com/apache/beam/pull/6417

There are a few bugs in Calcite that this has exposed:

Fixed in Calcite master:

   - CALCITE-2321 
   - The type of a union of CHAR columns of different lengths should be VARCHAR
   - CALCITE-2447  -
   Some POWER, ATAN2 functions fail with NoSuchMethodException

Pending PRs:

   - CALCITE-2529 
   - linq4j should promote integer to floating point when generating function
   calls
   - CALCITE-2530 
   - TRIM function does not throw exception when the length of trim character
   is not 1(one)

More work:

   - CALCITE-2404  -
   Accessing structured-types is not implemented by the runtime
   - (none yet) - Support multi character TRIM extension in Calcite

I would like to push these changes in with these minor regressions. Do any
of these Calcite bugs block this functionality being adding to Beam?

Andrew


Re: [Proposal] Add a static PTransform.compose() method for composing transforms in a lambda expression

2018-09-17 Thread Jeff Klukas
I've gone ahead and filed a JIRA Issue and GitHub PR to follow up on this
suggestion and make it more concrete:

https://issues.apache.org/jira/browse/BEAM-5413
https://github.com/apache/beam/pull/6414

On Fri, Sep 14, 2018 at 1:42 PM Jeff Klukas  wrote:

> Hello all, I'm a data engineer at Mozilla working on a first project using
> Beam. I've been impressed with the usability of the API as there are good
> built-in solutions for handling many simple transformation cases with
> minimal code, and wanted to discuss one bit of ergonomics that seems to be
> missing.
>
> It appears that none of the existing PTransform factories are generic
> enough to take in or output a PCollectionTuple, but we've found many use
> cases where it's convenient to apply a few transforms on a PCollectionTuple
> in a lambda expression.
>
> For example, we've defined several PTransforms that return main and error
> output stream bundled in a PCollectionTuple. We defined a
> CompositeTransform interface so that we could handle the error output in a
> lambda expression like:
>
> pipeline
> .apply("attempt to deserialize messages", new
> MyDeserializationTransform())
> .apply("write deserialization errors",
> CompositeTransform.of((PCollectionTuple input) -> {
> input.get(errorTag).apply(new MyErrorOutputTransform())
> return input.get(mainTag);
> })
> .apply("more processing on the deserialized messages", new
> MyOtherTransform())
>
> I'd be interested in contributing a patch to add this functionality,
> perhaps as a static method PTransform.compose(). Would that patch be
> welcome? Are there other thoughts on naming?
>
> The full code of the CompositeTransform interface we're currently using is
> included below.
>
>
> public interface CompositeTransform POutput> {
>   OutputT expand(InputT input);
>
>   /**
>* The public factory method that serves as the entrypoint for users to
> create a composite PTransform.
>*/
>   static 
> PTransform of(CompositeTransform
> transform) {
> return new PTransform() {
>   @Override
>   public OutputT expand(InputT input) {
> return transform.expand(input);
>   }
> };
>   }
> }
>
>
>
>


Re: SplittableDoFn

2018-09-17 Thread Lukasz Cwik
The changes to the API have not been proposed yet. So far it has all been
about what is the representation and why.

For splitting, the current idea has been about using the backlog as a way
of telling the SplittableDoFn where to split, so it would be in terms of
whatever the SDK decided to report.
The runner always chooses a number for backlog that is relative to the SDKs
reported backlog. It would be upto the SDK to round/clamp the number given
by the Runner to represent something meaningful for itself.
For example if the backlog that the SDK was reporting was bytes remaining
in a file such as 500, then the Runner could provide some value like 212.2
which the SDK would then round to 212.
If the backlog that the SDK was reporting 57 pubsub messages, then the
Runner could provide a value like 300 which would mean to read 57 values
and then another 243 as part of the current restriction.

I believe that BoundedSource/UnboundedSource will have wrappers added that
provide a basic SplittableDoFn implementation so existing IOs should be
migrated over without API changes.

On Mon, Sep 17, 2018 at 1:11 AM Ismaël Mejía  wrote:

> Thanks a lot Luke for bringing this back to the mailing list and Ryan for
> taking
> the notes.
>
> I would like to know if there was some discussion, or if you guys have
> given
> some thought to the required changes in the SDK (API) part. What will be
> the
> equivalent of `splitAtFraction` and what should IO authors do to support
> it..
>
> On Sat, Sep 15, 2018 at 1:52 AM Lukasz Cwik  wrote:
> >
> > Thanks to everyone who joined and for the questions asked.
> >
> > Ryan graciously collected notes of the discussion:
> https://docs.google.com/document/d/1kjJLGIiNAGvDiUCMEtQbw8tyOXESvwGeGZLL-0M06fQ/edit?usp=sharing
> >
> > The summary was that bringing BoundedSource/UnboundedSource into using a
> unified backlog-reporting mechanism with optional other signals that
> Dataflow has found useful (such as is the remaining restriction splittable
> (yes, no, unknown)). Other runners can use or not. SDFs should report
> backlog and watermark as minimum bar. The backlog should use an arbitrary
> precision float such as Java BigDecimal to prevent issues where limited
> precision removes the ability to compute delta efficiently.
> >
> >
> >
> > On Wed, Sep 12, 2018 at 3:54 PM Lukasz Cwik  wrote:
> >>
> >> Here is the link to join the discussion:
> https://meet.google.com/idc-japs-hwf
> >> Remember that it is this Friday Sept 14th from 11am-noon PST.
> >>
> >>
> >>
> >> On Mon, Sep 10, 2018 at 7:30 AM Maximilian Michels 
> wrote:
> >>>
> >>> Thanks for moving forward with this, Lukasz!
> >>>
> >>> Unfortunately, can't make it on Friday but I'll sync with somebody on
> >>> the call (e.g. Ryan) about your discussion.
> >>>
> >>> On 08.09.18 02:00, Lukasz Cwik wrote:
> >>> > Thanks for everyone who wanted to fill out the doodle poll. The most
> >>> > popular time was Friday Sept 14th from 11am-noon PST. I'll send out a
> >>> > calendar invite and meeting link early next week.
> >>> >
> >>> > I have received a lot of feedback on the document and have addressed
> >>> > some parts of it including:
> >>> > * clarifying terminology
> >>> > * processing skew due to some restrictions having their watermarks
> much
> >>> > further behind then others affecting scheduling of bundles by runners
> >>> > * external throttling & I/O wait overhead reporting to make sure we
> >>> > don't overscale
> >>> >
> >>> > Areas that still need additional feedback and details are:
> >>> > * reporting progress around the work that is done and is active
> >>> > * more examples
> >>> > * unbounded restrictions being caused by an unbounded number of
> splits
> >>> > of existing unbounded restrictions (infinite work growth)
> >>> > * whether we should be reporting this information at the PTransform
> >>> > level or at the bundle level
> >>> >
> >>> >
> >>> >
> >>> > On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik  >>> > > wrote:
> >>> >
> >>> > Thanks to all those who have provided interest in this topic by
> the
> >>> > questions they have asked on the doc already and for those
> >>> > interested in having this discussion. I have setup this doodle to
> >>> > allow people to provide their availability:
> >>> > https://doodle.com/poll/nrw7w84255xnfwqy
> >>> >
> >>> > I'll send out the chosen time based upon peoples availability
> and a
> >>> > Hangout link by end of day Friday so please mark your
> availability
> >>> > using the link above.
> >>> >
> >>> > The agenda of the meeting will be as follows:
> >>> > * Overview of the proposal
> >>> > * Enumerate and discuss/answer questions brought up in the
> meeting
> >>> >
> >>> > Note that all questions and any discussions/answers provided
> will be
> >>> > added to the doc for those who are unable to attend.
> >>> >
> >>> > On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré
> >>> > 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Romain Manni-Bucau
@Charles: just "mvn verify" (it is an integration-test)

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le lun. 17 sept. 2018 à 19:54, Charles Chen  a écrit :

> Can you send the specific command / config you are using to reproduce?
>
> On Mon, Sep 17, 2018 at 10:28 AM Romain Manni-Bucau 
> wrote:
>
>> @Charles: guess you can download
>> https://github.com/Talend/component-runtime/tree/master/component-runtime-beam/src/it/serialization-over-cluster
>> subproject, replace project.version by 1.0.4 (other placeholders are on the
>> net/more known) and you should be able to reproduce forcing beam to be in
>> v2.7.0. I dont have much time this week to check out this particular issue
>> but hopefully next one should be more doable if the issue is still pending.
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le lun. 17 sept. 2018 à 19:18, Charles Chen  a écrit :
>>
>>> Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
>>> release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?
>>>
>>> Romain, JB: is there any way I can help with debugging the issue you're
>>> facing so we can unblock the release?
>>>
>>> On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:
>>>
 I would like propose one more cherrypick for RC2 :
 https://github.com/apache/beam/pull/6391
 This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
 work around for them, especially on Dataflow. Only work around in Dataflow
 is to restart or reload the job.

 The fix itself fairly safe and is tested.
 Raghu.

 On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> Perhaps it could help, but I run simple WordCount (built with Beam
> 2.7) on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>
> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
> wrote:
>
> Hi Charles,
>
> I didn't get enough time to check deeply but it is clearly a
> dependency issue and it is not in beam spark runner itself but in another
> transitive module of beam. It does not happen in existing spark test cause
> none of them are in a cluster (even just with 1 worker) but this seems to
> be a regression since 2.6 works OOTB.
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>
>> Romain and JB, can you please add the results of your investigations
>> into the errors you've seen above?  Given that the existing SparkRunner
>> tests pass for this RC, and that the integration test you ran is in 
>> another
>> repo that is not continuously tested with Beam, it is not clear how we
>> should move forward and whether this is a blocking issue, unless we can
>> find a root cause in Beam.
>>
>> On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot <
>> echauc...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> on a performance and functional regression stand point I see no
>>> regression:
>>>
>>> I looked at nexmark graphs "output pcollection size" and "execution
>>> time" around release cut date on dataflow, spark, flink and direct 
>>> runner
>>> in batch and streaming modes. There seems to be no regression.
>>>
>>> Etienne
>>>
>>> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>>>
>>> The SparkRunner validation test (here:
>>> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
>>> passes on my machine.  It looks like we are likely missing test coverage
>>> where Romain is hitting issues.
>>>
>>> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay 
>>> wrote:
>>>
>>> Could anyone else help with looking at these issues earlier?
>>>
>>> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>> Im running this main [1] through this IT [2]. Was working fine since
>>> ~1 year but 2.7.0 

Re: [Help wanted] Fixing beam_PerformanceTests_Python

2018-09-17 Thread Mark Liu
I updated the Jira

.

On Mon, Sep 17, 2018 at 10:58 AM Huygaa Batsaikhan 
wrote:

> Thanks, please keep the bug updated.
>
> On Mon, Sep 17, 2018 at 10:56 AM Ahmet Altay  wrote:
>
>> I talked with Mark. His PR (#6392) might help. He also a few more ideas
>> for debugging. If it does not work, I will work with Mark to resolve this.
>>
>> On Mon, Sep 17, 2018 at 10:45 AM, Huygaa Batsaikhan 
>> wrote:
>>
>>>
>>>
>>> On Mon, Sep 17, 2018 at 10:42 AM Huygaa Batsaikhan 
>>> wrote:
>>>
 Hi devs, Python performance tests have been failing for a while due to
 incompatible packages in dependencies. Could anyone fix this issue? Here is
 the Jira
 
  link.
 Thanks.

 Failure link:
 https://builds.apache.org/job/beam_PerformanceTests_Python/1436/console

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "DataPLS Programmability Team" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to datapls-programmability-team+unsubscr...@google.com.
>>> To post to this group, send email to
>>> datapls-programmability-t...@google.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/a/google.com/d/msgid/datapls-programmability-team/CABJ_Qd3cwT6GTAXc%2BKDA%2B_VQTpUQeXfOs0vsMgcjfUD2e2syEw%40mail.gmail.com
>>> 
>>> .
>>>
>>
>>


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Lukasz Cwik
I created this cherry pick for the missing pom.xml issue:
https://github.com/apache/beam/pull/6412

On Mon, Sep 17, 2018 at 10:54 AM Charles Chen  wrote:

> Can you send the specific command / config you are using to reproduce?
>
> On Mon, Sep 17, 2018 at 10:28 AM Romain Manni-Bucau 
> wrote:
>
>> @Charles: guess you can download
>> https://github.com/Talend/component-runtime/tree/master/component-runtime-beam/src/it/serialization-over-cluster
>> subproject, replace project.version by 1.0.4 (other placeholders are on the
>> net/more known) and you should be able to reproduce forcing beam to be in
>> v2.7.0. I dont have much time this week to check out this particular issue
>> but hopefully next one should be more doable if the issue is still pending.
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le lun. 17 sept. 2018 à 19:18, Charles Chen  a écrit :
>>
>>> Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
>>> release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?
>>>
>>> Romain, JB: is there any way I can help with debugging the issue you're
>>> facing so we can unblock the release?
>>>
>>> On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:
>>>
 I would like propose one more cherrypick for RC2 :
 https://github.com/apache/beam/pull/6391
 This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
 work around for them, especially on Dataflow. Only work around in Dataflow
 is to restart or reload the job.

 The fix itself fairly safe and is tested.
 Raghu.

 On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> Perhaps it could help, but I run simple WordCount (built with Beam
> 2.7) on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>
> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
> wrote:
>
> Hi Charles,
>
> I didn't get enough time to check deeply but it is clearly a
> dependency issue and it is not in beam spark runner itself but in another
> transitive module of beam. It does not happen in existing spark test cause
> none of them are in a cluster (even just with 1 worker) but this seems to
> be a regression since 2.6 works OOTB.
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>
>> Romain and JB, can you please add the results of your investigations
>> into the errors you've seen above?  Given that the existing SparkRunner
>> tests pass for this RC, and that the integration test you ran is in 
>> another
>> repo that is not continuously tested with Beam, it is not clear how we
>> should move forward and whether this is a blocking issue, unless we can
>> find a root cause in Beam.
>>
>> On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot <
>> echauc...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> on a performance and functional regression stand point I see no
>>> regression:
>>>
>>> I looked at nexmark graphs "output pcollection size" and "execution
>>> time" around release cut date on dataflow, spark, flink and direct 
>>> runner
>>> in batch and streaming modes. There seems to be no regression.
>>>
>>> Etienne
>>>
>>> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>>>
>>> The SparkRunner validation test (here:
>>> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
>>> passes on my machine.  It looks like we are likely missing test coverage
>>> where Romain is hitting issues.
>>>
>>> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay 
>>> wrote:
>>>
>>> Could anyone else help with looking at these issues earlier?
>>>
>>> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>> Im running this main [1] through this IT [2]. Was working fine since
>>> ~1 year but 2.7.0 broke it. Didnt investigate more but can have a look
>>> later this month if it helps.
>>>
>>> [1]
>>> 

Re: [Help wanted] Fixing beam_PerformanceTests_Python

2018-09-17 Thread Huygaa Batsaikhan
Thanks, please keep the bug updated.

On Mon, Sep 17, 2018 at 10:56 AM Ahmet Altay  wrote:

> I talked with Mark. His PR (#6392) might help. He also a few more ideas
> for debugging. If it does not work, I will work with Mark to resolve this.
>
> On Mon, Sep 17, 2018 at 10:45 AM, Huygaa Batsaikhan 
> wrote:
>
>>
>>
>> On Mon, Sep 17, 2018 at 10:42 AM Huygaa Batsaikhan 
>> wrote:
>>
>>> Hi devs, Python performance tests have been failing for a while due to
>>> incompatible packages in dependencies. Could anyone fix this issue? Here is
>>> the Jira
>>> 
>>>  link.
>>> Thanks.
>>>
>>> Failure link:
>>> https://builds.apache.org/job/beam_PerformanceTests_Python/1436/console
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "DataPLS Programmability Team" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datapls-programmability-team+unsubscr...@google.com.
>> To post to this group, send email to
>> datapls-programmability-t...@google.com.
>> To view this discussion on the web visit
>> https://groups.google.com/a/google.com/d/msgid/datapls-programmability-team/CABJ_Qd3cwT6GTAXc%2BKDA%2B_VQTpUQeXfOs0vsMgcjfUD2e2syEw%40mail.gmail.com
>> 
>> .
>>
>
>


Re: [Help wanted] Fixing beam_PerformanceTests_Python

2018-09-17 Thread Ahmet Altay
I talked with Mark. His PR (#6392) might help. He also a few more ideas for
debugging. If it does not work, I will work with Mark to resolve this.

On Mon, Sep 17, 2018 at 10:45 AM, Huygaa Batsaikhan 
wrote:

>
>
> On Mon, Sep 17, 2018 at 10:42 AM Huygaa Batsaikhan 
> wrote:
>
>> Hi devs, Python performance tests have been failing for a while due to
>> incompatible packages in dependencies. Could anyone fix this issue? Here is
>> the Jira
>> 
>>  link.
>> Thanks.
>>
>> Failure link: https://builds.apache.org/job/beam_PerformanceTests_
>> Python/1436/console
>>
> --
> You received this message because you are subscribed to the Google Groups
> "DataPLS Programmability Team" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datapls-programmability-team+unsubscr...@google.com.
> To post to this group, send email to datapls-programmability-team@
> google.com.
> To view this discussion on the web visit https://groups.google.com/a/
> google.com/d/msgid/datapls-programmability-team/CABJ_
> Qd3cwT6GTAXc%2BKDA%2B_VQTpUQeXfOs0vsMgcjfUD2e2syEw%40mail.gmail.com
> 
> .
>


Re: [Help wanted] Fixing beam_PerformanceTests_Python

2018-09-17 Thread Huygaa Batsaikhan
On Mon, Sep 17, 2018 at 10:42 AM Huygaa Batsaikhan 
wrote:

> Hi devs, Python performance tests have been failing for a while due to
> incompatible packages in dependencies. Could anyone fix this issue? Here is
> the Jira
> 
>  link.
> Thanks.
>
> Failure link:
> https://builds.apache.org/job/beam_PerformanceTests_Python/1436/console
>


[Help wanted] Fixing beam_PerformanceTests_Python

2018-09-17 Thread Huygaa Batsaikhan
Hi devs, Python performance tests have been failing for a while due to
incompatible packages in dependencies. Could anyone fix this issue? Here is
the Jira

link.
Thanks.

Failure link:
https://builds.apache.org/job/beam_PerformanceTests_Python/1436/console


Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Romain Manni-Bucau
@Charles: guess you can download
https://github.com/Talend/component-runtime/tree/master/component-runtime-beam/src/it/serialization-over-cluster
subproject, replace project.version by 1.0.4 (other placeholders are on the
net/more known) and you should be able to reproduce forcing beam to be in
v2.7.0. I dont have much time this week to check out this particular issue
but hopefully next one should be more doable if the issue is still pending.

Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Le lun. 17 sept. 2018 à 19:18, Charles Chen  a écrit :

> Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
> release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?
>
> Romain, JB: is there any way I can help with debugging the issue you're
> facing so we can unblock the release?
>
> On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:
>
>> I would like propose one more cherrypick for RC2 :
>> https://github.com/apache/beam/pull/6391
>> This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
>> work around for them, especially on Dataflow. Only work around in Dataflow
>> is to restart or reload the job.
>>
>> The fix itself fairly safe and is tested.
>> Raghu.
>>
>> On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Perhaps it could help, but I run simple WordCount (built with Beam 2.7)
>>> on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>>>
>>> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
>>> wrote:
>>>
>>> Hi Charles,
>>>
>>> I didn't get enough time to check deeply but it is clearly a dependency
>>> issue and it is not in beam spark runner itself but in another transitive
>>> module of beam. It does not happen in existing spark test cause none of
>>> them are in a cluster (even just with 1 worker) but this seems to be a
>>> regression since 2.6 works OOTB.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>>
>>> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>>>
 Romain and JB, can you please add the results of your investigations
 into the errors you've seen above?  Given that the existing SparkRunner
 tests pass for this RC, and that the integration test you ran is in another
 repo that is not continuously tested with Beam, it is not clear how we
 should move forward and whether this is a blocking issue, unless we can
 find a root cause in Beam.

 On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot 
 wrote:

> Hi all,
>
> on a performance and functional regression stand point I see no
> regression:
>
> I looked at nexmark graphs "output pcollection size" and "execution
> time" around release cut date on dataflow, spark, flink and direct runner
> in batch and streaming modes. There seems to be no regression.
>
> Etienne
>
> Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :
>
> The SparkRunner validation test (here:
> https://beam.apache.org/contribute/release-guide/#run-validation-tests)
> passes on my machine.  It looks like we are likely missing test coverage
> where Romain is hitting issues.
>
> On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:
>
> Could anyone else help with looking at these issues earlier?
>
> On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
> Im running this main [1] through this IT [2]. Was working fine since
> ~1 year but 2.7.0 broke it. Didnt investigate more but can have a look
> later this month if it helps.
>
> [1]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
> [2]
> https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java
>
> Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :
>
> Romain: can you give more details on the failure you're encountering,
> i.e. how you are performing this validation?
>
> On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
> wrote:
>
> Hi,
>
> weird, I didn't have it on Beam samples. 

Re: [VOTE] Release 2.7.0, release candidate #1

2018-09-17 Thread Charles Chen
Luke, Maximillian, Raghu, can you please propose cherry-pick PRs to the
release-2.7.0 for your issues and add me as a reviewer (@charlesccychen)?

Romain, JB: is there any way I can help with debugging the issue you're
facing so we can unblock the release?

On Fri, Sep 14, 2018 at 1:49 PM Raghu Angadi  wrote:

> I would like propose one more cherrypick for RC2 :
> https://github.com/apache/beam/pull/6391
> This is a KafkaIO bug fix. Once a user hits this bug, there is no easy
> work around for them, especially on Dataflow. Only work around in Dataflow
> is to restart or reload the job.
>
> The fix itself fairly safe and is tested.
> Raghu.
>
> On Fri, Sep 14, 2018 at 12:52 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Perhaps it could help, but I run simple WordCount (built with Beam 2.7)
>> on YARN/Spark (HDP Sandbox) cluster and it worked fine for me.
>>
>> On 14 Sep 2018, at 06:56, Romain Manni-Bucau 
>> wrote:
>>
>> Hi Charles,
>>
>> I didn't get enough time to check deeply but it is clearly a dependency
>> issue and it is not in beam spark runner itself but in another transitive
>> module of beam. It does not happen in existing spark test cause none of
>> them are in a cluster (even just with 1 worker) but this seems to be a
>> regression since 2.6 works OOTB.
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le jeu. 13 sept. 2018 à 22:15, Charles Chen  a écrit :
>>
>>> Romain and JB, can you please add the results of your investigations
>>> into the errors you've seen above?  Given that the existing SparkRunner
>>> tests pass for this RC, and that the integration test you ran is in another
>>> repo that is not continuously tested with Beam, it is not clear how we
>>> should move forward and whether this is a blocking issue, unless we can
>>> find a root cause in Beam.
>>>
>>> On Wed, Sep 12, 2018 at 2:08 AM Etienne Chauchot 
>>> wrote:
>>>
 Hi all,

 on a performance and functional regression stand point I see no
 regression:

 I looked at nexmark graphs "output pcollection size" and "execution
 time" around release cut date on dataflow, spark, flink and direct runner
 in batch and streaming modes. There seems to be no regression.

 Etienne

 Le mardi 11 septembre 2018 à 12:25 -0700, Charles Chen a écrit :

 The SparkRunner validation test (here:
 https://beam.apache.org/contribute/release-guide/#run-validation-tests)
 passes on my machine.  It looks like we are likely missing test coverage
 where Romain is hitting issues.

 On Tue, Sep 11, 2018 at 12:15 PM Ahmet Altay  wrote:

 Could anyone else help with looking at these issues earlier?

 On Tue, Sep 11, 2018 at 12:03 PM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

 Im running this main [1] through this IT [2]. Was working fine since ~1
 year but 2.7.0 broke it. Didnt investigate more but can have a look later
 this month if it helps.

 [1]
 https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/main/java/org/talend/sdk/component/beam/it/clusterserialization/Main.java
 [2]
 https://github.com/Talend/component-runtime/blob/master/component-runtime-beam/src/it/serialization-over-cluster/src/test/java/org/talend/sdk/component/beam/it/SerializationOverClusterIT.java

 Le mar. 11 sept. 2018 20:54, Charles Chen  a écrit :

 Romain: can you give more details on the failure you're encountering,
 i.e. how you are performing this validation?

 On Tue, Sep 11, 2018 at 9:36 AM Jean-Baptiste Onofré 
 wrote:

 Hi,

 weird, I didn't have it on Beam samples. Let me try to reproduce and I
 will create the Jira.

 Regards
 JB

 On 11/09/2018 11:44, Romain Manni-Bucau wrote:
 > -1, seems spark integration is broken (tested with spark 2.3.1 and
 2.2.1):
 >
 > 18/09/11 11:33:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0
 (TID 0, RMANNIBUCAU, executor 0): java.lang.ClassCastException: cannot
 assign instance of scala.collection.immutable.List$SerializationProxy to
 fieldorg.apache.spark.rdd.RDD.org
  <
 http://org.apache.spark.rdd.RDD.org
 >$apache$spark$rdd$RDD$$dependencies_
 of type scala.collection.Seq in instance of
 org.apache.spark.rdd.MapPartitionsRDD
 >   at
 java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2233)
 >
 >
 > Also the issue Lukasz identified is important even 

Re: Beam dashboards

2018-09-17 Thread Mikhail Gryzykhin
Thank you for feedback.

I had the idea of adding the list of tests with duration. Unfortunately,
I'm not clear on how to represent such metric. One of the ideas I had is to
show graph with top 10 slowest tests for each job run. However we have too
many different test jobs. I'm open to brainstorming ideas in this area.

Meanwhile I'll play around with different options and see what I can come
up with.

Regards,
--Mikhail

Have feedback ?


On Sun, Sep 16, 2018 at 4:04 AM Maximilian Michels  wrote:

> Thanks Mikhail, that will help to identify flaky or slow tests. At the
> size of the Beam code base, such statistics are extremely helpful.
>
> If we had a list of test cases ordered by test duration, that would be a
> great addition.
>
> On 14.09.18 00:30, Connell O'Callaghan wrote:
> > Thank you Mikhail for sharing this and to everyone involved in these
> > improvements!!! It will be great to hear about progress and any blockers
> > encountered with this work.
> >
> > On Thu, Sep 13, 2018 at 11:12 AM Mikhail Gryzykhin  > > wrote:
> >
> > Hi everyone,
> >
> > Huygaa and me are working on creating dashboards for Beam.
> >
> > So far I created a POC dashboard for post-commit greenness
> >  that includes
> > a list of latest failed jobs and dashboard for pre-commits
> >  duration, that
> > one contains only duration so far.
> >
> > We have work ongoing for adding github statistics.
> >
> > Please, check those out. Any feedback is welcome.
> >
> > Small hints:
> > * You can change time range you're looking at in to-right corner.
> > * You can choose dashboard by clicking dashboard name on top-left
> > corner.
> >
> > As a short insight:
> > Thank you everyone who helps fixing post-commit tests flakes. We
> > moved from ~30% to 75-85% successful runs.
> >
> > Best regards,
> > --Mikhail
> >
> > Have feedback ?
> >
>


Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Maximilian Michels

[Copying this also to the dev list]

+1. A version compatibility table would be great!

> I don't know if Flink could do something like this (become a provided
> dep) in particular for the current case where there seems not to be
> API breaking changes.

That doesn't work. The Flink Runner is too tightly integrated with Flink 
internals, and these internals are not always optimally decoupled. This 
fails already at the client side, e.g. when submitting a Flink job via 
Beam to a Flink cluster. Though it should be better now with the new 
Rest-based clients.


On 17.09.18 09:48, Ismaël Mejía wrote:

In the Spark runner the user provides the core spark dependencies at runtime and
we assume that backwards compatibility is kept (in upstream Spark). We support
the whole 2.x line but we try to keep the version close to the latest stable
release.

Notice however that we lack tests to validate that all versions do work, I
remember some issues with metrics during the migration to spark 2.x with older
versions of spark (<= 2.1). Those worked flawlessly with more recent versions.

I don't know if Flink could do something like this (become a provided
dep) in particular for the current case where there seems not to be
API breaking changes.

In any case +1 to try to get a bit the act together on this.

On Mon, Sep 17, 2018 at 9:31 AM Robert Bradshaw  wrote:


On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett  
wrote:


Do we currently maintain a finer grained list of compatibility between 
execution/runner versions and beam versions?  Is this only really a concern 
with recent Flink (sounded like at least Spark jump, too)?  I see the 
capability matrix:  
https://beam.apache.org/documentation/runners/capability-matrix/, but some sort 
of compatibility between runner versions with beam releases might be useful.

I see compatibility matrix as far as beam features, but not for underlying 
runners.  Ex: something like this would save a user trying to get Beam working 
on recent Flink 1.6 and then subsequently hitting a (potentially not well 
documented) wall given known issues.



+1. I was bitten by this as well.

I don't know if it's worth having a compatibility matrix for each version (as 
the overlap is likely to be all or nothing in most cases), but it should be 
prominently displayed here and elsewhere. Want to send out a PR?


Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Maximilian Michels
> To clarify, there are two classes of OldRunner users here: those that 
want new features in Beam, and those that simply want to run a supported 
version of Beam. The LTS proposal helps the latter, which is going to be 
biased towards those not upgrading their runners. The former are worth 
supporting as well.


Agree. Apart from the LTS, we want to minimize upgrade pain for new Beam 
versions as much as possible.


On 17.09.18 09:30, Robert Bradshaw wrote:
On Sun, Sep 16, 2018 at 12:59 PM Maximilian Michels > wrote:


 > If I understand the LTS proposal correctly, then it will be a
release line that continues to receive patches (as in semantic
versioning), but no new features as that would defeat the purpose
(stability).

It matters insofar, as execution engine upgrades could be performed in
the master but the LTS version won't receive them. So LTS is the go-to
if you want to ensure compatibility with your existing setup.


To clarify, there are two classes of OldRunner users here: those that 
want new features in Beam, and those that simply want to run a supported 
version of Beam. The LTS proposal helps the latter, which is going to be 
biased towards those not upgrading their runners. The former are worth 
supporting as well.


 > To limit the pain of dealing with incompatible runner changes and
copies within Beam, we should probably also work with the respective
community to improve the compatibility story.

Absolutely. If we find that we can improve compatibility with upstream
changes, we should go that path. Even if we don't have a dedicated
compatibility layer upstream yet.

On 13.09.18 19:34, Thomas Weise wrote:
 >
 > On Thu, Sep 13, 2018 at 9:49 AM Maximilian Michels
mailto:m...@apache.org>
 > >> wrote:
 >
 >     Thank you for your comments. Let me try to summarize what has
been
 >     discussed so far:
 >
 >     1. The Beam LTS version will ensure a stable execution engine
for as
 >     long as the LTS life span.
 >
 >
 > If I understand the LTS proposal correctly, then it will be a
release
 > line that continues to receive patches (as in semantic
versioning), but
 > no new features as that would defeat the purpose (stability).
 >
 > If so, then I don't think LTS matters for this discussion.
 >
 >     2. We agree that pushing updates to the execution engine for the
 >     Runners
 >     is only desirable if it results in a better integration with
the Beam
 >     model or if it is necessary due security or performance reasons.
 >
 >     3. We might have to consider adding additional build targets
for a
 >     Runner for whenever the execution engine gets upgraded. This
might be
 >     really easy if the engine's API remains stable. It might also be
 >     desirable if the upgrade path is not easy and not completely
 >     foreseeable, e.g. Etienne mentioned Spark 1.x vs Spark 2.x
Runner. The
 >     Beam feature set could vary depending on the version.
 >
 >
 > To limit the pain of dealing with incompatible runner changes and
copies
 > within Beam, we should probably also work with the respective
community
 > to improve the compatibility story.
 >
 >
 >     4. In the long run, we want a stable abstraction layer for
each Runner
 >     that, ideally, is maintained by the upstream of the execution
 >     engine. In
 >     the short run, this is probably not realistic, as the shared
libraries
 >     of Beam are not stable enough.
 >
 >
 > Yes, that will only become an option once we reach interface
stability.
 > Similar to how the runner projects maintain their IO connectors.
 >
 >     On 13.09.18 14:39, Robert Bradshaw wrote:
 >      > The ideal long-term solution is, as Romain mentions,
pushing the
 >      > runner-specific code up to be maintained by each runner with a
 >     stable
 >      > API to use to talk to Beam. Unfortunately, I think we're
still a
 >     long
 >      > way from having this Stable API, or having the clout for
 >      > non-beam-developers to maintain these bindings externally
(though
 >      > hopefully we'll get there).
 >      >
 >      > In the short term, we're stuck with either hurting users that
 >     want to
 >      > stick with Flink 1.5, hurting users that want to upgrade
to Flink
 >     1.6,
 >      > or supporting both. Is Beam's interaction with Flink such
that we
 >     can't
 >      > simply have separate targets linking the same Beam code
against
 >     one or
 >      > the other? (I.e. are code changes needed?) If so, we'll
probably
 >     need a
 >      > flink-runner-1.5 module, a 

Beam Dependency Check Report (2018-09-17)

2018-09-17 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
google-cloud-bigquery
0.25.0
1.5.0
2017-06-26
2017-06-26BEAM-5079
google-cloud-pubsub
0.26.0
0.38.0
2017-06-26
2017-06-26BEAM-5398
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
org.assertj:assertj-core
2.5.0
3.11.1
2016-07-03
2018-08-28BEAM-5289
com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2014-10-25
2017-12-11BEAM-4874
biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2011-04-01
2013-02-27BEAM-4884
com.gradle:build-scan-plugin
1.13.1
1.16
2018-04-10
2018-08-27BEAM-5224
org.apache.cassandra:cassandra-all
3.9
3.11.3
2016-09-26
2018-07-25BEAM-5083
commons-cli:commons-cli
1.2
1.4
2009-03-19
2017-03-09BEAM-4896
commons-codec:commons-codec
1.9
1.11
2013-12-21
2017-10-17BEAM-4898
org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-03
2018-07-13BEAM-4900
com.typesafe:config
1.3.0
1.3.3
2015-05-08
2018-02-21BEAM-4902
de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-21BEAM-4904
de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-10
2018-06-21BEAM-4905
org.elasticsearch:elasticsearch-hadoop
5.0.0
6.4.0
2016-10-26
2018-08-18BEAM-5226
net.ltgt.gradle:gradle-apt-plugin
0.13
0.18
2017-11-01
2018-07-23BEAM-4924
com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.15.1
2018-01-30
2018-09-17BEAM-5399
net.ltgt.gradle:gradle-errorprone-plugin
0.0.13
0.6
2018-06-08
2018-09-17BEAM-5400
com.google.code.gson:gson
2.7
2.8.5
2016-06-14
2018-05-22BEAM-4947
com.google.guava:guava
20.0
26.0-jre
2016-10-28
2018-08-01BEAM-5085
org.apache.hbase:hbase-common
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4951
org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4952
org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4953
org.apache.hbase:hbase-server
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4954
org.apache.hbase:hbase-shaded-client
1.2.6
2.1.0
2017-05-29
2018-07-10BEAM-4955
org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2017-08-16BEAM-4956
org.apache.hive:hive-cli
2.1.0
3.1.0
2016-06-17
2018-07-23BEAM-5230
org.apache.hive:hive-common
2.1.0
3.1.0
2016-06-17
2018-07-23BEAM-5231
org.apache.hive:hive-exec
2.1.0
3.1.0
2016-06-17
2018-07-23BEAM-5232
org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.0
2016-06-17
2018-07-23BEAM-5233
net.java.dev.javacc:javacc
4.0
7.0.3
2006-03-17
2017-11-06BEAM-4969
org.slf4j:jcl-over-slf4j
1.7.25
1.8.0-beta2
2017-03-16
2018-03-21BEAM-5234
net.java.dev.jna:jna
4.1.0
4.5.2
2014-03-06
2018-07-12BEAM-4973
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-4975
org.apache.kudu:kudu-client
1.4.0
1.7.1
2017-06-05
2018-05-30BEAM-5087
io.dropwizard.metrics:metrics-core
3.1.2
4.1.0-rc2
2015-04-26
2018-05-03BEAM-4977
org.mongodb:mongo-java-driver
3.2.2
3.8.1
2016-02-15
2018-08-20BEAM-5235
io.netty:netty-all
4.1.17.Final
5.0.0.Alpha2
2017-11-08
2015-03-03BEAM-4981
io.opencensus:opencensus-contrib-grpc-metrics
0.12.3
0.16.0
2018-04-13
2018-09-14BEAM-5401
org.apache.qpid:proton-j
0.13.1
0.29.0
2016-07-02
2018-08-10BEAM-5153

Re: SplittableDoFn

2018-09-17 Thread Ismaël Mejía
Thanks a lot Luke for bringing this back to the mailing list and Ryan for taking
the notes.

I would like to know if there was some discussion, or if you guys have given
some thought to the required changes in the SDK (API) part. What will be the
equivalent of `splitAtFraction` and what should IO authors do to support it..

On Sat, Sep 15, 2018 at 1:52 AM Lukasz Cwik  wrote:
>
> Thanks to everyone who joined and for the questions asked.
>
> Ryan graciously collected notes of the discussion: 
> https://docs.google.com/document/d/1kjJLGIiNAGvDiUCMEtQbw8tyOXESvwGeGZLL-0M06fQ/edit?usp=sharing
>
> The summary was that bringing BoundedSource/UnboundedSource into using a 
> unified backlog-reporting mechanism with optional other signals that Dataflow 
> has found useful (such as is the remaining restriction splittable (yes, no, 
> unknown)). Other runners can use or not. SDFs should report backlog and 
> watermark as minimum bar. The backlog should use an arbitrary precision float 
> such as Java BigDecimal to prevent issues where limited precision removes the 
> ability to compute delta efficiently.
>
>
>
> On Wed, Sep 12, 2018 at 3:54 PM Lukasz Cwik  wrote:
>>
>> Here is the link to join the discussion: https://meet.google.com/idc-japs-hwf
>> Remember that it is this Friday Sept 14th from 11am-noon PST.
>>
>>
>>
>> On Mon, Sep 10, 2018 at 7:30 AM Maximilian Michels  wrote:
>>>
>>> Thanks for moving forward with this, Lukasz!
>>>
>>> Unfortunately, can't make it on Friday but I'll sync with somebody on
>>> the call (e.g. Ryan) about your discussion.
>>>
>>> On 08.09.18 02:00, Lukasz Cwik wrote:
>>> > Thanks for everyone who wanted to fill out the doodle poll. The most
>>> > popular time was Friday Sept 14th from 11am-noon PST. I'll send out a
>>> > calendar invite and meeting link early next week.
>>> >
>>> > I have received a lot of feedback on the document and have addressed
>>> > some parts of it including:
>>> > * clarifying terminology
>>> > * processing skew due to some restrictions having their watermarks much
>>> > further behind then others affecting scheduling of bundles by runners
>>> > * external throttling & I/O wait overhead reporting to make sure we
>>> > don't overscale
>>> >
>>> > Areas that still need additional feedback and details are:
>>> > * reporting progress around the work that is done and is active
>>> > * more examples
>>> > * unbounded restrictions being caused by an unbounded number of splits
>>> > of existing unbounded restrictions (infinite work growth)
>>> > * whether we should be reporting this information at the PTransform
>>> > level or at the bundle level
>>> >
>>> >
>>> >
>>> > On Wed, Sep 5, 2018 at 1:53 PM Lukasz Cwik >> > > wrote:
>>> >
>>> > Thanks to all those who have provided interest in this topic by the
>>> > questions they have asked on the doc already and for those
>>> > interested in having this discussion. I have setup this doodle to
>>> > allow people to provide their availability:
>>> > https://doodle.com/poll/nrw7w84255xnfwqy
>>> >
>>> > I'll send out the chosen time based upon peoples availability and a
>>> > Hangout link by end of day Friday so please mark your availability
>>> > using the link above.
>>> >
>>> > The agenda of the meeting will be as follows:
>>> > * Overview of the proposal
>>> > * Enumerate and discuss/answer questions brought up in the meeting
>>> >
>>> > Note that all questions and any discussions/answers provided will be
>>> > added to the doc for those who are unable to attend.
>>> >
>>> > On Fri, Aug 31, 2018 at 9:47 AM Jean-Baptiste Onofré
>>> > mailto:j...@nanthrax.net>> wrote:
>>> >
>>> > +1
>>> >
>>> > Regards
>>> > JB
>>> > Le 31 août 2018, à 18:22, Lukasz Cwik >> > > a écrit:
>>> >
>>> > That is possible, I'll take people's date/time suggestions
>>> > and create a simple online poll with them.
>>> >
>>> > On Fri, Aug 31, 2018 at 2:22 AM Robert Bradshaw
>>> > mailto:rober...@google.com>> wrote:
>>> >
>>> > Thanks for taking this up. I added some comments to the
>>> > doc. A European-friendly time for discussion would
>>> > be great.
>>> >
>>> > On Fri, Aug 31, 2018 at 3:14 AM Lukasz Cwik
>>> > mailto:lc...@google.com>> wrote:
>>> >
>>> > I came up with a proposal[1] for a progress model
>>> > solely based off of the backlog and that splits
>>> > should be based upon the remaining backlog we want
>>> > the SDK to split at. I also give recommendations to
>>> > runner authors as to how an autoscaling system could
>>> > work based upon the measured backlog. A lot of
>>> > discussions around progress reporting and splitting

Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett 
wrote:

> Do we currently maintain a finer grained list of compatibility between
> execution/runner versions and beam versions?  Is this only really a concern
> with recent Flink (sounded like at least Spark jump, too)?  I see the
> capability matrix:
> https://beam.apache.org/documentation/runners/capability-matrix/, but
> some sort of compatibility between runner versions with beam releases might
> be useful.
>
> I see compatibility matrix as far as beam features, but not for underlying
> runners.  Ex: something like this would save a user trying to get Beam
> working on recent Flink 1.6 and then subsequently hitting a (potentially
> not well documented) wall given known issues.
>

+1. I was bitten by this as well.

I don't know if it's worth having a compatibility matrix for each version
(as the overlap is likely to be all or nothing in most cases), but it
should be prominently displayed here and elsewhere. Want to send out a PR?


Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Robert Bradshaw
On Sun, Sep 16, 2018 at 12:59 PM Maximilian Michels  wrote:

> > If I understand the LTS proposal correctly, then it will be a release
> line that continues to receive patches (as in semantic versioning), but no
> new features as that would defeat the purpose (stability).
>
> It matters insofar, as execution engine upgrades could be performed in
> the master but the LTS version won't receive them. So LTS is the go-to
> if you want to ensure compatibility with your existing setup.
>

To clarify, there are two classes of OldRunner users here: those that want
new features in Beam, and those that simply want to run a supported version
of Beam. The LTS proposal helps the latter, which is going to be biased
towards those not upgrading their runners. The former are worth supporting
as well.


> > To limit the pain of dealing with incompatible runner changes and copies
> within Beam, we should probably also work with the respective community to
> improve the compatibility story.
>
> Absolutely. If we find that we can improve compatibility with upstream
> changes, we should go that path. Even if we don't have a dedicated
> compatibility layer upstream yet.
>
> On 13.09.18 19:34, Thomas Weise wrote:
> >
> > On Thu, Sep 13, 2018 at 9:49 AM Maximilian Michels  > > wrote:
> >
> > Thank you for your comments. Let me try to summarize what has been
> > discussed so far:
> >
> > 1. The Beam LTS version will ensure a stable execution engine for as
> > long as the LTS life span.
> >
> >
> > If I understand the LTS proposal correctly, then it will be a release
> > line that continues to receive patches (as in semantic versioning), but
> > no new features as that would defeat the purpose (stability).
> >
> > If so, then I don't think LTS matters for this discussion.
> >
> > 2. We agree that pushing updates to the execution engine for the
> > Runners
> > is only desirable if it results in a better integration with the Beam
> > model or if it is necessary due security or performance reasons.
> >
> > 3. We might have to consider adding additional build targets for a
> > Runner for whenever the execution engine gets upgraded. This might be
> > really easy if the engine's API remains stable. It might also be
> > desirable if the upgrade path is not easy and not completely
> > foreseeable, e.g. Etienne mentioned Spark 1.x vs Spark 2.x Runner.
> The
> > Beam feature set could vary depending on the version.
> >
> >
> > To limit the pain of dealing with incompatible runner changes and copies
> > within Beam, we should probably also work with the respective community
> > to improve the compatibility story.
> >
> >
> > 4. In the long run, we want a stable abstraction layer for each
> Runner
> > that, ideally, is maintained by the upstream of the execution
> > engine. In
> > the short run, this is probably not realistic, as the shared
> libraries
> > of Beam are not stable enough.
> >
> >
> > Yes, that will only become an option once we reach interface stability.
> > Similar to how the runner projects maintain their IO connectors.
> >
> > On 13.09.18 14:39, Robert Bradshaw wrote:
> >  > The ideal long-term solution is, as Romain mentions, pushing the
> >  > runner-specific code up to be maintained by each runner with a
> > stable
> >  > API to use to talk to Beam. Unfortunately, I think we're still a
> > long
> >  > way from having this Stable API, or having the clout for
> >  > non-beam-developers to maintain these bindings externally (though
> >  > hopefully we'll get there).
> >  >
> >  > In the short term, we're stuck with either hurting users that
> > want to
> >  > stick with Flink 1.5, hurting users that want to upgrade to Flink
> > 1.6,
> >  > or supporting both. Is Beam's interaction with Flink such that we
> > can't
> >  > simply have separate targets linking the same Beam code against
> > one or
> >  > the other? (I.e. are code changes needed?) If so, we'll probably
> > need a
> >  > flink-runner-1.5 module, a flink-runner-1.6, and a
> > flink-runner-common
> >  > module. Or we hope that all users are happy with 1.5 until a
> certain
> >  > point in time when they all want to simultaneously jump to 1.6
> > and Beam
> >  > at the same time. Maybe that's enough in the short term, but
> > longer term
> >  > we need a more sustainable solution.
> >  >
> >  >
> >  > On Thu, Sep 13, 2018 at 7:13 AM Romain Manni-Bucau
> >  > mailto:rmannibu...@gmail.com>
> > >>
> wrote:
> >  >
> >  > Hi guys,
> >  >
> >  > Isnt the issue "only" that beam has this code instead of
> engines?
> >  >
> >  > Assuming beam runner facing api is stable - which must be the
> > case
> >  > anyway - and that each engine has its 

Re: SparkRunner - GroupByKey

2018-09-17 Thread David Morávek
Thanks! I've created BEAM-5392
 to track the issue.

On Fri, Sep 14, 2018 at 4:46 PM Robert Bradshaw  wrote:

> On Fri, Sep 14, 2018 at 4:22 PM David Morávek 
> wrote:
>
>> Hello Robert,
>>
>> thanks for the answer! Spark allows us to sort the single partition
>> (after repartition), by user provided comparator, so it is definitely
>> possible to do secondary sort by timestamp. The "more intelligent
>> ReduceFnRunner" you are talking about, is it part of Beam codebase already
>> (I guess it would lower the contribution investment, if we'd try to fix
>> this)?
>>
>
> No, but it is part of the dataflow worker codebase we're trying to donate
> (being discussed on the other thread on this very list today), so hopefully
> soon.
>
> This would definitely work for our use case (we used exact same approach
>> in our custom SparkRunner).
>>
>> Although, there is one think to consider. This approach would solve the
>> scaling issue, but it would be probably less effective for the smaller
>> scale. I think this could be solved by providing support for multiple
>> translators for a single operator and let user "hint" the translation layer
>> to decide which one to use. What do you think?
>>
>
> If at all possible, I generally prefer to avoid providing hints like this
> that the user needs to use to get decent performance as it simply doesn't
> scale (in many directions). Fortunately in this case, though you have to be
> a bit more careful about things, it is not less efficient.
>
> On Fri, Sep 14, 2018 at 4:10 PM Robert Bradshaw 
>> wrote:
>>
>>>
>>> If Spark supports producing grouped elements in timestamp order, a more
>>> intelligent ReduceFnRunner can be used. (We take advantage of that in
>>> Dataflow for example.)
>>>
>>> For non-merging windows, you could also put the window itself (or some
>>> subset thereof) into the key resulting in smaller groupings. I'm not sure I
>>> understand your output requirements enough to know if this would work.
>>>
>>> On Fri, Sep 14, 2018 at 3:28 PM David Morávek 
>>> wrote:
>>>
 Hello,

 currently, we are trying to move one of our large scale batch jobs
 (~100TB inputs) from our Euphoria 
 based SparkRunner to Beam's Spark runner and we came across the following
 issue.

 Because we rely on hadoop ecosystem, we need to group outputs by
 TaskAttemptID, in order to use OutputFormats based on FileOutputFormat.

 We do this by using *GroupByKey*, but we came across the known
 problem, that all values for any single key need to fit in-memory at once.

 I did a quick research and I think that following needs to be addressed:
 a) We can not use Spark's *groupByKey*, because it requires all values
 to fit in memory for a single key (it is implemented as "list combiner")
 b) *ReduceFnRunner* iterates over values multiple times in order to
 group also by window

>>>
 In Euphoria based runner, we solved this for *non-merging* windowing
 by using Spark's *repartitionAndSortWithinPartitions*, where we sorted
 output by key and window, so the output could be processed sequentially.

 Did anyone run into the same issue? Is there currently any workaround
 for this? How should we approach this?

 Thanks,
 David