Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Charles Chen
+1

Note that we need to temporarily revert
https://github.com/apache/beam/pull/6683 before the release branch cut per
the discussion at
https://lists.apache.org/thread.html/78fe33dc41b04886f5355d66d50359265bfa2985580bb70f79c53545@%3Cdev.beam.apache.org%3E

On Thu, Nov 15, 2018 at 9:18 PM Tim  wrote:

> Thanks Cham
> +1
>
> On 16 Nov 2018, at 05:30, Thomas Weise  wrote:
>
> +1
>
>
> On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:
>
>> +1 Thank you.
>>
>> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:
>>
>>> SGTM. Thanks for keeping track of the schedule.
>>>
>>> Kenn
>>>
>>> On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath 
>>> wrote:
>>>
 Hi All,

 According to the release calendar [1] branch cut date for Beam 2.9.0
 release is 11/21/2018. Since previous release branch was cut close to the
 respective calendar date I'd like to propose cutting release branch for
 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
 some folks will be out so we can try to produce RC1 on Monday after
 (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.

 I'd like to volunteer to perform this release.

 WDYT ?

 Thanks,
 Cham

 [1]
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 [2] https://s.apache.org/beam-2.9.0-burndown


>>


A new Beam Runner on Apache Nemo

2018-11-15 Thread 송원욱
Hello all!

I'm a member of the Apache Nemo community, another Apache project for
processing big data focusing on easy-to-use, flexible optimizations for
various deployment environments. More information can be seen on our website
. We've been building the system for quite a while
now, and we have been using Apache Beam as one of the programming layers
that we support for writing data processing applications. We have already
taken a look at the capability matrix
 of Beam
runners, and the runner authoring guide
, and we have been
successful in implementing a large portion of the capability criteria.

With the progress, we wish to be able to list our runner as one of the Beam
runners, to  be able to notify the users that our system supports Beam, and
that Beam users have another option to choose from for running their data
processing applications. It would be lovely to know the details of the
process required for it!

Thanks!
Wonook


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Tim
Thanks Cham
+1

> On 16 Nov 2018, at 05:30, Thomas Weise  wrote:
> 
> +1
> 
> 
>> On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:
>> +1 Thank you.
>> 
>>> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:
>>> SGTM. Thanks for keeping track of the schedule.
>>> 
>>> Kenn
>>> 
 On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath  
 wrote:
 Hi All,
 
 According to the release calendar [1] branch cut date for Beam 2.9.0 
 release is 11/21/2018. Since previous release branch was cut close to the 
 respective calendar date I'd like to propose cutting release branch for 
 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly 
 some folks will be out so we can try to produce RC1 on Monday after 
 (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime. 
 
 I'd like to volunteer to perform this release.
 
 WDYT ?
 
 Thanks,
 Cham
 
 [1] 
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 [2] https://s.apache.org/beam-2.9.0-burndown
 
>> 


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Thomas Weise
+1


On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:

> +1 Thank you.
>
> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:
>
>> SGTM. Thanks for keeping track of the schedule.
>>
>> Kenn
>>
>> On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath 
>> wrote:
>>
>>> Hi All,
>>>
>>> According to the release calendar [1] branch cut date for Beam 2.9.0
>>> release is 11/21/2018. Since previous release branch was cut close to the
>>> respective calendar date I'd like to propose cutting release branch for
>>> 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
>>> some folks will be out so we can try to produce RC1 on Monday after
>>> (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.
>>>
>>> I'd like to volunteer to perform this release.
>>>
>>> WDYT ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
>>> [2] https://s.apache.org/beam-2.9.0-burndown
>>>
>>>
>


Re: [VOTE] Release Vendored gRPC 1.13.1 and Guava 20.0, release candidate #1

2018-11-15 Thread Thomas Weise
Thanks for driving this. Did we reach a conclusion regarding publishing
relocated source artifacts? Debugging would be painful without (unless
manually installed in the local repo).


On Thu, Nov 15, 2018 at 6:05 PM Lukasz Cwik  wrote:

> Please review and vote on the release candidate #1 for the vendored
> artifacts gRPC 1.13.1 and Guava 20.0:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The creation of these artifacts are the outcome of the discussion about
> vendoring[1].
>
> The complete staging area is available for your review, which includes:
> * all artifacts to be deployed to the Maven Central Repository [2],
> * commit hash "3678d403fcfea6a3994d7b86cfe6db70039087b0" [3],
> * Java artifacts were built with Gradle 4.10.2 and OpenJDK 1.8.0_161
> * artifacts which are signed with the key with fingerprint
> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [4]
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Luke
>
> [1]
> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
> [2] https://repository.apache.org/content/repositories/orgapachebeam-1052
> [3]
> https://github.com/apache/beam/tree/3678d403fcfea6a3994d7b86cfe6db70039087b0
> [4] https://dist.apache.org/repos/dist/release/beam/KEYS
>
>


Re: [VOTE] Mark 2.7.0 branch as a long term support (LTS) branch

2018-11-15 Thread Ahmet Altay
Thank you all for voting. This vote was open for 7 days, let's wrap it up.
There were 8 +1, 1 +0, and no -1 votes.

- I added a new version 2.7.1 for tracking anything that could be used for
tracking whatever we would like to consider for backporting (
https://issues.apache.org/jira/projects/BEAM/versions/12344458).
- I will update our website to include information about 2.7.0 being a LTS
release and will be supported until 5/16/2019.

Ahmet

On Tue, Nov 13, 2018 at 5:43 AM, Jean-Baptiste Onofré 
wrote:

> +0
>
> Regards
> JB
>
> On 09/11/2018 02:47, Ahmet Altay wrote:
> > Hi all,
> >
> > Please review the following statement:
> >
> > "2.7.0 branch will be marked as the long-term-support (LTS) release
> > branch. This branch will be supported for a window of 6 months starting
> > from the day it is marked as an LTS branch. Beam community will decide
> > on which issues will be backported and when patch releases on the branch
> > will be made on a case by case basis.
> >
> > [ ] +1, Approve
> > [ ] -1, Do not approve (please provide specific comments)
> >
> > Context:
> > - Discussion on the dev@ list [1].
> > - Beam's release policies [2].
> >
> > Thank you,
> > Ahmet
> >
> > [1] https://lists.apache.org/thread.html/f604b2d688ad467ddccd4cf56b664b
> 7309dae78f1bd8849e4bb9aae2@%3Cdev.beam.apache.org%3E
> > [2] https://beam.apache.org/community/policies/
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Publish vendored dependencies independently

2018-11-15 Thread Lukasz Cwik
I have created the artifacts and sent out a vote thread.

On Thu, Nov 15, 2018 at 5:23 PM Lukasz Cwik  wrote:

> On Thu, Nov 15, 2018 at 11:47 AM Thomas Weise  wrote:
>
>> In case any of this affects how artifacts are published in general,
>> please make sure that publishing to 3rd party repo continues to work.
>>
>> For example: ./gradlew :beam-runners-flink_2.11-job-server:publish
>> -PisRelease -PnoSigning -PdistMgmtSnapshotsUrl=
>> https://mycustomrepo/libs-release
>>
>> Yes, I still kept this around since I used the code that we currently use
> for publishing.
>
>
>> Thanks,
>> Thomas
>>
>>
>> On Thu, Nov 15, 2018 at 11:27 AM Kenneth Knowles  wrote:
>>
>>> Agree on the low bar. We should just make them all 0.x releases to send
>>> the right message (don't use, and no compatibility) and not worry as much
>>> about bad releases, which we
>>> would never actually depend on in the project.
>>>
>>> QQ: What does the new -P flag do? I was also hoping to eliminate the
>>> redundant -PisRelease flag, especially for vendored deps that should really
>>> be straight line.
>>>
>>
> I found having the -PisRelease flag useful for local testing because I
> could publish -SNAPSHOT builds but it isn't strictly necessary.
> The -PvendoredDependenciesOnly enables publishing of vendored dependencies
> so they aren't part of the regular release process. The name could be
> changed to be something more appropriate.
>
>
>>
>>> Kenn
>>>
>>> On Wed, Nov 14, 2018 at 12:38 PM Lukasz Cwik  wrote:
>>>
 Its a small hassle but could be checked in with some changes, my
 example commit was so that people could try it out as is.

 I'll work towards getting it checked in and then start a release for
 gRPC and guava.

 On Wed, Nov 14, 2018 at 11:45 AM Scott Wegner  wrote:

> Thanks for pushing this forward Luke.
>
> My understanding is that these vendored grpc artifacts will only be
> consumed directly by Beam internal components (as opposed to Beam user
> projects). So there should be a fairly low bar for publishing them. But
> perhaps we should have some short checklist for releasing them for
> consistency.
>
> One item I would suggest for such a checklist would be to publish
> artifacts from checked-in apache/beam sources and then tag the release
> commit. Is it possible to get your changes merged in first, or is there a
> chicken-and-egg problem that artifacts need to be published and available
> for consumption?
>
> On Wed, Nov 14, 2018 at 10:51 AM Lukasz Cwik  wrote:
>
>> Note, I could also release the vendored version of guava 20 in
>> preparation for us to start consuming it. Any concerns?
>>
>> On Tue, Nov 13, 2018 at 3:59 PM Lukasz Cwik  wrote:
>>
>>> I have made some incremental progress on this and wanted to release
>>> our first vendored dependency of gRPC 1.13.1 since I was able to fix a 
>>> good
>>> number of the import/code completion errors that Intellij was 
>>> experiencing.
>>> I have published an example of what the jar/pom looks like in the Apache
>>> Staging repo:
>>>
>>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-vendor-grpc-1_13_1/
>>>
>>> You can also checkout[1] and from a clean workspace run:
>>> ./gradlew :beam-vendor-grpc-1_13_1:publishToMavenLocal -PisRelease
>>> -PvendoredDependenciesOnly
>>> which will build a vendored version of gRPC that is published to
>>> your local maven repository. All the projects that depended on the 
>>> gradle
>>> beam-vendor-grpc-1_13_1 project are now pointing at the Maven artifact
>>> org.apache.beam:beam-vendor-grpc-1_13_1:0.1
>>>
>>> I was planning to follow the Apache Beam release process but only
>>> for this specific artifact and start a vote thread if there aren't any
>>> concerns.
>>>
>>> 1:
>>> https://github.com/lukecwik/incubator-beam/commit/4b1b7b40ef316559f81c42dfdd44da988db201e9
>>>
>>>
>>> On Thu, Oct 25, 2018 at 10:59 AM Lukasz Cwik 
>>> wrote:
>>>
 Thats a good point Thomas, hadn't considered the lib/ case. I also
 am recommending what Thomas is suggesting as well.

 On Thu, Oct 25, 2018 at 10:52 AM Maximilian Michels 
 wrote:

> On 25.10.18 19:23, Lukasz Cwik wrote:
> >
> >
> > On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels <
> m...@apache.org
> > > wrote:
> >
> > Question: How would a user end up with the same shaded
> dependency
> > twice?
> > The shaded dependencies are transitive dependencies of Beam
> and thus,
> > this shouldn't happen. Is this a safe-guard when running
> different
> > versions of Beam in the same JVM?
> >
> >
> > What I was referring 

[VOTE] Release Vendored gRPC 1.13.1 and Guava 20.0, release candidate #1

2018-11-15 Thread Lukasz Cwik
Please review and vote on the release candidate #1 for the vendored
artifacts gRPC 1.13.1 and Guava 20.0:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The creation of these artifacts are the outcome of the discussion about
vendoring[1].

The complete staging area is available for your review, which includes:
* all artifacts to be deployed to the Maven Central Repository [2],
* commit hash "3678d403fcfea6a3994d7b86cfe6db70039087b0" [3],
* Java artifacts were built with Gradle 4.10.2 and OpenJDK 1.8.0_161
* artifacts which are signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [4]

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Luke

[1]
https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
[2] https://repository.apache.org/content/repositories/orgapachebeam-1052
[3]
https://github.com/apache/beam/tree/3678d403fcfea6a3994d7b86cfe6db70039087b0
[4] https://dist.apache.org/repos/dist/release/beam/KEYS


Re: [DISCUSS] Publish vendored dependencies independently

2018-11-15 Thread Lukasz Cwik
On Thu, Nov 15, 2018 at 11:47 AM Thomas Weise  wrote:

> In case any of this affects how artifacts are published in general, please
> make sure that publishing to 3rd party repo continues to work.
>
> For example: ./gradlew :beam-runners-flink_2.11-job-server:publish
> -PisRelease -PnoSigning -PdistMgmtSnapshotsUrl=
> https://mycustomrepo/libs-release
>
> Yes, I still kept this around since I used the code that we currently use
for publishing.


> Thanks,
> Thomas
>
>
> On Thu, Nov 15, 2018 at 11:27 AM Kenneth Knowles  wrote:
>
>> Agree on the low bar. We should just make them all 0.x releases to send
>> the right message (don't use, and no compatibility) and not worry as much
>> about bad releases, which we
>> would never actually depend on in the project.
>>
>> QQ: What does the new -P flag do? I was also hoping to eliminate the
>> redundant -PisRelease flag, especially for vendored deps that should really
>> be straight line.
>>
>
I found having the -PisRelease flag useful for local testing because I
could publish -SNAPSHOT builds but it isn't strictly necessary.
The -PvendoredDependenciesOnly enables publishing of vendored dependencies
so they aren't part of the regular release process. The name could be
changed to be something more appropriate.


>
>> Kenn
>>
>> On Wed, Nov 14, 2018 at 12:38 PM Lukasz Cwik  wrote:
>>
>>> Its a small hassle but could be checked in with some changes, my example
>>> commit was so that people could try it out as is.
>>>
>>> I'll work towards getting it checked in and then start a release for
>>> gRPC and guava.
>>>
>>> On Wed, Nov 14, 2018 at 11:45 AM Scott Wegner  wrote:
>>>
 Thanks for pushing this forward Luke.

 My understanding is that these vendored grpc artifacts will only be
 consumed directly by Beam internal components (as opposed to Beam user
 projects). So there should be a fairly low bar for publishing them. But
 perhaps we should have some short checklist for releasing them for
 consistency.

 One item I would suggest for such a checklist would be to publish
 artifacts from checked-in apache/beam sources and then tag the release
 commit. Is it possible to get your changes merged in first, or is there a
 chicken-and-egg problem that artifacts need to be published and available
 for consumption?

 On Wed, Nov 14, 2018 at 10:51 AM Lukasz Cwik  wrote:

> Note, I could also release the vendored version of guava 20 in
> preparation for us to start consuming it. Any concerns?
>
> On Tue, Nov 13, 2018 at 3:59 PM Lukasz Cwik  wrote:
>
>> I have made some incremental progress on this and wanted to release
>> our first vendored dependency of gRPC 1.13.1 since I was able to fix a 
>> good
>> number of the import/code completion errors that Intellij was 
>> experiencing.
>> I have published an example of what the jar/pom looks like in the Apache
>> Staging repo:
>>
>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-vendor-grpc-1_13_1/
>>
>> You can also checkout[1] and from a clean workspace run:
>> ./gradlew :beam-vendor-grpc-1_13_1:publishToMavenLocal -PisRelease
>> -PvendoredDependenciesOnly
>> which will build a vendored version of gRPC that is published to your
>> local maven repository. All the projects that depended on the gradle
>> beam-vendor-grpc-1_13_1 project are now pointing at the Maven artifact
>> org.apache.beam:beam-vendor-grpc-1_13_1:0.1
>>
>> I was planning to follow the Apache Beam release process but only for
>> this specific artifact and start a vote thread if there aren't any 
>> concerns.
>>
>> 1:
>> https://github.com/lukecwik/incubator-beam/commit/4b1b7b40ef316559f81c42dfdd44da988db201e9
>>
>>
>> On Thu, Oct 25, 2018 at 10:59 AM Lukasz Cwik 
>> wrote:
>>
>>> Thats a good point Thomas, hadn't considered the lib/ case. I also
>>> am recommending what Thomas is suggesting as well.
>>>
>>> On Thu, Oct 25, 2018 at 10:52 AM Maximilian Michels 
>>> wrote:
>>>
 On 25.10.18 19:23, Lukasz Cwik wrote:
 >
 >
 > On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels <
 m...@apache.org
 > > wrote:
 >
 > Question: How would a user end up with the same shaded
 dependency
 > twice?
 > The shaded dependencies are transitive dependencies of Beam
 and thus,
 > this shouldn't happen. Is this a safe-guard when running
 different
 > versions of Beam in the same JVM?
 >
 >
 > What I was referring to was that they aren't exactly the same
 dependency
 > but slightly different versions of the same dependency. Since we
 are
 > planning to vendor each dependency and its transitive
 dependencies as
 

Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Ahmet Altay
+1 Thank you.

On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:

> SGTM. Thanks for keeping track of the schedule.
>
> Kenn
>
> On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath 
> wrote:
>
>> Hi All,
>>
>> According to the release calendar [1] branch cut date for Beam 2.9.0
>> release is 11/21/2018. Since previous release branch was cut close to the
>> respective calendar date I'd like to propose cutting release branch for
>> 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
>> some folks will be out so we can try to produce RC1 on Monday after
>> (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.
>>
>> I'd like to volunteer to perform this release.
>>
>> WDYT ?
>>
>> Thanks,
>> Cham
>>
>> [1] https://calendar.google.com/calendar/embed?src=
>> 0p73sl034k80oob7seouanigd0%40group.calendar.google.com&
>> ctz=America%2FLos_Angeles
>> [2] https://s.apache.org/beam-2.9.0-burndown
>>
>>


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Kenneth Knowles
SGTM. Thanks for keeping track of the schedule.

Kenn

On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath 
wrote:

> Hi All,
>
> According to the release calendar [1] branch cut date for Beam 2.9.0
> release is 11/21/2018. Since previous release branch was cut close to the
> respective calendar date I'd like to propose cutting release branch for
> 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
> some folks will be out so we can try to produce RC1 on Monday after
> (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.
>
> I'd like to volunteer to perform this release.
>
> WDYT ?
>
> Thanks,
> Cham
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
> [2] https://s.apache.org/beam-2.9.0-burndown
>
>


Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Andrew Pilloud
I've spent most of the day getting beam working on Calcite 1.18 and now I'm
just starting to work on making your Joda UDF changes compatible with
Calcite's code generation, so I haven't gotten very far into investigating
how we might handle this yet.

Calcite has multiple aggregate function formats, including ones that have
access the window, but it assumes that scalar project functions never need
to get at the window.  I'm still struggling to understand what the right
semantics are here. I'm not convinced this belongs in the scalar UDF path.
Might this really be something that is an extension on join?

As for making immediate progress: When we switch to Calcite's code
generation, it looks like we can implement our own generator for windowed
UDFs. We could put the window into Calcite's DataContext
,
and then pull it out for the function. I'm a little hesitant to do this way
because it will force us to run the Calc operator (all the scalar
operations) in a windowed doFn. I think we will need to push this
distinction into the rel layer (as a WindowedCalc operation) so the planner
can take it into consideration. I'm going to keep moving forward on this
tomorrow, hope to have a proof of concept showing we can generate our own
UDF calls.

Andrew

On Thu, Nov 15, 2018 at 12:11 PM Rui Wang  wrote:

> For the not working SESSION_END(), I had an investigation on it:
>
> https://issues.apache.org/jira/browse/BEAM-5799
> https://issues.apache.org/jira/browse/CALCITE-2645
>
> According to the reply in the Calcite JIRA, there might be some other way
> to implement SESSION_END. I haven't looked into it though.
>
> -Rui
>
> On Thu, Nov 15, 2018 at 11:56 AM Mingmin Xu  wrote:
>
>> 1. Window start/end: Actually this is already provided in other ways and
>> the window in the SQL environment is unused and just waiting to be deleted.
>> So you can still access TUMBLE_START, etc. This is well-defined as a part
>> of the row so there's no semantic problem, but I think it should already
>> work.
>> *MM: Others work except SESSION_END();*
>>
>> 2. Pane information: I don't think access to pane info is enough for
>> correct results for a SQL join that triggers more than once. The pane info
>> is part of a Beam element, but these records just represent a kind of
>> changelog of the aggregation/join. The general solution is retractions.
>> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
>> often a stateful DoFn to get the join results right. For example, if both
>> inputs are append-only relations and it is an equijoin, then you can do
>> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
>> the main use case for BEAM-5204. Is it your use case?
>> *MM: my case is a self-join with SQL-only, written as [DISCARD_Pane JOIN
>> ACCU_Pane];*
>> *These UDFs is not a blocker, limitation in BEAM-5204 should be removed
>> directly IMO. With multiple-trigger assigned, developers need to handle the
>> output which is not complex with Java SDK, but very hard for SQL only
>> cases. *
>>
>>
>> On Thu, Nov 15, 2018 at 10:54 AM Kenneth Knowles  wrote:
>>
>>> From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what
>>> you most care about is to have joins that trigger more than once per
>>> window. To accomplish it you hope to build an "escape hatch" from
>>> SQL/relational semantics to specialized Beam SQL semantics. It could make
>>> sense with extreme care.
>>>
>>> Separating the two parts:
>>>
>>> 1. Window start/end: Actually this is already provided in other ways and
>>> the window in the SQL environment is unused and just waiting to be deleted.
>>> So you can still access TUMBLE_START, etc. This is well-defined as a part
>>> of the row so there's no semantic problem, but I think it should already
>>> work.
>>>
>>> 2. Pane information: I don't think access to pane info is enough for
>>> correct results for a SQL join that triggers more than once. The pane info
>>> is part of a Beam element, but these records just represent a kind of
>>> changelog of the aggregation/join. The general solution is retractions.
>>> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
>>> often a stateful DoFn to get the join results right. For example, if both
>>> inputs are append-only relations and it is an equijoin, then you can do
>>> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
>>> the main use case for BEAM-5204. Is it your use case?
>>>
>>> Kenn
>>>
>>> On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu  wrote:
>>>
 Raise this thread.
 Seems there're more changes in the backend on how a FUNCTION is
 executed in the backend, as noticed in #6996
 :
 1. BeamSqlExpression and BeamSqlExpressionExecutor are 

Re: Need help regarding memory leak issue

2018-11-15 Thread Ruoyun Huang
trying to understand the situation you are having.

By saying 'kills the appllication', is that a leak in the application
itself, or the workers being the root cause?  Also are you running ML
models inside Python SDK DoFn's?  Then I suppose it is running some
predictions rather than model training?

On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar  wrote:

> I am using *Beam Python SDK *to run my app in production. The app is
> running machine learning models. I am noticing some memory leak which
> eventually kills the application. I am not sure the source of memory leak.
> Currently, I am using object graph
>  to dump the memory
> stats. I hope I will get some useful information out of this. I have also
> looked into Guppy library  and they are
> almost the same.
>
> Do you guys have any recommendation for debugging this issue? Do we have
> any tooling in the SDK that can help to debug it?
> Please feel free to share your experience if you have debugged similar
> issues in past.
>
> Thank you,
> Rakesh
>


-- 

Ruoyun  Huang


Apache Beam Newsletter - November 2018

2018-11-15 Thread Rose Nguyen
[image: Beam.png]

November 2018 | Newsletter

What’s been done

Beam Community Metrics (by: Mikhail Gryzykhin, Udi Meiri, Huygaa Batsaikhan)

   -

   To help track project health status, added dashboarding platform.
   -

   Initial dashboards were created that aim at tracking pre- and
   post-commit tests status and engineering load.
   -

   Leave feedback under BEAM-5862
   -

   View the dashboards here 

Apache Beam 2.8.0 released! (by: many contributors)

   -

   Major new features and improvements, such as Python on Flink MVP
   -

   You can download the release here
   .
   -

   See the blog post
    for more
   details.


New Edit button on beam.apache.org pages (by: Alan Myrvold, Scott Wegner)

   -

   To make it easier for non-committers  to update documentation, an edit
   button has been added on https://beam.apache.org pages to help create a
   pull request using the GitHub web UI.
   -

   See BEAM-4431 for more details.

RabbitMqIO (by: Jean-Baptiste Onofré)

   -

   A IO to publish or consume messages with a RabbitMQ broker


Graphite sink for metrics (by: Etienne Chauchot)

   -

   Metrics Pusher can now export Beam metrics to Graphite


BeamSQL (by: Rui Wang, Mingming Xu)

   -

   Add 13 built-in SQL functions.
   -

   Enable function overloading for UDF by a new UDF registration approach.
   -

   UDF supports Joda DateTime as argument type.

What we're working on...

Flink Portable Runner (by: Ankur Goenka, Maximilian Michels, Thomas Weise,
Ryan Williams, Robert Bradshaw)

   -

   Integration of timers in user functions for streaming and batch execution
   -

   Enabling TFX pipelines to run on Flink
   -

   Investigating the integration of metrics
   -

   Bug fixes


Load tests of Core Apache Beam Operations (by: Łukasz Gajowy, Katarzyna
Kucharczyk)

   -

   Test operations such as GroupByKey, ParDo, Combine etc in stressful
   conditions.
   -

   See https://s.apache.org/load-test-basic-operations for more details on
how it works.

New Members
New Committers

   -

   David Morávek, Pilsen, Czech Republic
   -

  Using Beam for an internet scale web crawler
  -

  See BEAM-3900 for more details.

Talks & Meetups

Hadoop User Group @ Warsaw, Poland

   -

   Apache Beam - what do I gain? by Łukasz Gajowy (link to the meetup
   )
   -

   We discussed the basics of the Dataflow model,  Beam in more detail, and
   familiarized the audience with the current state of the project


Resources

Blog Post on London Summit (by: Matthias Baetens)

   -

   “Inaugural edition of the Beam Summit Europe 2018 - aftermath”- a recap
   of the conference, including the presentation slide decks.
   -

   See the post here
    and
   videos of the sessions on the Apache Beam YouTube channel
   .

How to transfer BigQuery tables between locations (by: Graham Polley)

   -

   A Cloud Dataflow solution in Java for transferring BigQuery tables
   including source code
   .
   -

   See the Medium article here
   

   .


Hands on Apache Beam, building data pipelines in Python (by: Graham Polley)

   -

   Writing a Beam pipeline in Python to compute the mean of the Open and
   Close columns for a historical S 500 dataset.
   -

   See the Medium Towards Data Science article here
   

   and GitHub tutorial here
   .


*Until Next Time!*

*This edition was curated by our community of contributors, committers and
PMCs. It contains work done in November 2018 and ongoing efforts. We hope
to provide visibility to what's going on in the community, so if you have
questions, feel free to ask in this thread.*
-- 
Rose Thị Nguyễn


Re: Please do not merge Python PRs

2018-11-15 Thread Udi Meiri
All clear, Python tests are reporting errors correctly again.

On Wed, Nov 14, 2018 at 5:57 PM Udi Meiri  wrote:

> https://github.com/apache/beam/pull/7048 is the rollback PR
>
> On Wed, Nov 14, 2018 at 5:28 PM Ahmet Altay  wrote:
>
>> Thank you Udi. Could you send a rollback PR?
>>
>> I believe this is https://issues.apache.org/jira/browse/BEAM-6048
>>
>> On Wed, Nov 14, 2018 at 5:16 PM, Udi Meiri  wrote:
>>
>>> It seems that Gradle is not getting the correct exit status from test
>>> runs.
>>> Possible culprit: https://github.com/apache/beam/pull/6903
>>>
>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


[PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Chamikara Jayalath
Hi All,

According to the release calendar [1] branch cut date for Beam 2.9.0
release is 11/21/2018. Since previous release branch was cut close to the
respective calendar date I'd like to propose cutting release branch for
2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly
some folks will be out so we can try to produce RC1 on Monday after
(11/26/2018). We can attend to current blocker JIRAs [2] in the meantime.

I'd like to volunteer to perform this release.

WDYT ?

Thanks,
Cham

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
[2] https://s.apache.org/beam-2.9.0-burndown


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-15 Thread Reuven Lax
I would hope so if possible.

On Fri, Nov 16, 2018, 4:36 AM Kenneth Knowles  Just some low-level detail: If there is no @DefaultSchema annotation but
> it is an @AutoValue class, can schema inference go ahead with the
> AutoValueSchema? Then the user doesn't have to do anything.
>
> Kenn
>
> On Wed, Nov 14, 2018 at 6:14 AM Reuven Lax  wrote:
>
>> We already have a framework for ByteBuddy codegen for JavaBean Row
>> interfaces, which should hopefully be easy to extend AutoValue (and more
>> efficient than using reflection). I'm working on adding constructor support
>> to this right now.
>>
>> On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas  wrote:
>>
>>> Sounds, then, like we need to a define a new `AutoValueSchema extends
>>> SchemaProvider` and users would opt-in to this via the DefaultSchema
>>> annotation:
>>>
>>> @DefaultSchema(AutoValueSchema.class)
>>> @AutoValue
>>> public abstract MyClass ...
>>>
>>> Since we already have the JavaBean and JavaField reflection-based schema
>>> providers to use as a guide, it sounds like it may be best to try to
>>> implement this using reflection rather than implementing an AutoValue
>>> extension.
>>>
>>> A reflection-based approach here would hinge on being able to discover
>>> the package-private constructor for the concrete class and read its types.
>>> Those types would define the schema, and the fromRow impementation would
>>> call the discovered constructor.
>>>
>>> On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax  wrote:
>>>


 On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas 
 wrote:

> Reuven - A SchemaProvider makes sense. It's not clear to me, though,
> whether that's more limited than a Coder. Do all values of the schema have
> to be simple types, or does Beam SQL support nested schemas?
>

 Nested schemas, collection types (lists and maps), and collections of
 nested types are all supported.

>
> Put another way, would a user be able to create an AutoValue class
> comprised of simple types and then use that as a field inside another
> AutoValue class? I can see how that's possible with Coders, but not clear
> whether that's possible with Row schemas.
>

 Yes, this is explicitly supported.

>
> On Fri, Nov 9, 2018 at 8:22 PM Reuven Lax  wrote:
>
>> Hi Jeff,
>>
>> I would suggest a slightly different approach. Instead of generating
>> a coder, writing a SchemaProvider that generates a schema for AutoValue.
>> Once a PCollection has a schema, a coder is not needed (as Beam knows how
>> to encode any type with a schema), and it will work seamlessly with Beam
>> SQL (in fact you don't need to write a transform to turn it into a Row 
>> if a
>> schema is registered).
>>
>> We already do this for POJOs and basic JavaBeans. I'm happy to help
>> do this for AutoValue.
>>
>> Reuven
>>
>> On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas 
>> wrote:
>>
>>> Hi all - I'm looking for some review and commentary on a proposed
>>> design for providing built-in Coders for AutoValue classes. There's
>>> existing discussion in BEAM-1891 [0] about using AvroCoder, but that's
>>> blocked on incompatibility between AutoValue and Avro's reflection
>>> machinery that don't look resolvable.
>>>
>>> I wrote up a design document [1] that instead proposes using
>>> AutoValue's extension API to automatically generate a Coder for each
>>> AutoValue class that users generate. A similar technique could be used 
>>> to
>>> generate conversions to and from Row for use with BeamSql.
>>>
>>> I'd appreciate review of the design and thoughts on whether this
>>> seems feasible to support within the Beam codebase. I may be missing a
>>> simpler approach.
>>>
>>> [0] https://issues.apache.org/jira/browse/BEAM-1891
>>> [1]
>>> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>>>
>>


Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Rui Wang
For the not working SESSION_END(), I had an investigation on it:

https://issues.apache.org/jira/browse/BEAM-5799
https://issues.apache.org/jira/browse/CALCITE-2645

According to the reply in the Calcite JIRA, there might be some other way
to implement SESSION_END. I haven't looked into it though.

-Rui

On Thu, Nov 15, 2018 at 11:56 AM Mingmin Xu  wrote:

> 1. Window start/end: Actually this is already provided in other ways and
> the window in the SQL environment is unused and just waiting to be deleted.
> So you can still access TUMBLE_START, etc. This is well-defined as a part
> of the row so there's no semantic problem, but I think it should already
> work.
> *MM: Others work except SESSION_END();*
>
> 2. Pane information: I don't think access to pane info is enough for
> correct results for a SQL join that triggers more than once. The pane info
> is part of a Beam element, but these records just represent a kind of
> changelog of the aggregation/join. The general solution is retractions.
> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
> often a stateful DoFn to get the join results right. For example, if both
> inputs are append-only relations and it is an equijoin, then you can do
> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
> the main use case for BEAM-5204. Is it your use case?
> *MM: my case is a self-join with SQL-only, written as [DISCARD_Pane JOIN
> ACCU_Pane];*
> *These UDFs is not a blocker, limitation in BEAM-5204 should be removed
> directly IMO. With multiple-trigger assigned, developers need to handle the
> output which is not complex with Java SDK, but very hard for SQL only
> cases. *
>
>
> On Thu, Nov 15, 2018 at 10:54 AM Kenneth Knowles  wrote:
>
>> From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what
>> you most care about is to have joins that trigger more than once per
>> window. To accomplish it you hope to build an "escape hatch" from
>> SQL/relational semantics to specialized Beam SQL semantics. It could make
>> sense with extreme care.
>>
>> Separating the two parts:
>>
>> 1. Window start/end: Actually this is already provided in other ways and
>> the window in the SQL environment is unused and just waiting to be deleted.
>> So you can still access TUMBLE_START, etc. This is well-defined as a part
>> of the row so there's no semantic problem, but I think it should already
>> work.
>>
>> 2. Pane information: I don't think access to pane info is enough for
>> correct results for a SQL join that triggers more than once. The pane info
>> is part of a Beam element, but these records just represent a kind of
>> changelog of the aggregation/join. The general solution is retractions.
>> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
>> often a stateful DoFn to get the join results right. For example, if both
>> inputs are append-only relations and it is an equijoin, then you can do
>> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
>> the main use case for BEAM-5204. Is it your use case?
>>
>> Kenn
>>
>> On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu  wrote:
>>
>>> Raise this thread.
>>> Seems there're more changes in the backend on how a FUNCTION is executed
>>> in the backend, as noticed in #6996
>>> :
>>> 1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
>>> 2. BeamSqlExpressionEnvironment are removed;
>>>
>>> Then,
>>> 1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which
>>> is great and duplicate work is worthless);
>>> *2. no way to access Beam context now;*
>>>
>>> For *#2*, I think we need to find a way to expose it, at least our
>>> UDF/UDAF should be able to access it to leverage the advantages of Beam
>>> module.
>>>
>>> Any comments?
>>>
>>>
>>> On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:
>>>
 This is a so exciting change!

 Since we cannot mix current implementation with Calcite code
 generation, is there any case that Calcite code generation does not support
 but our current implementation supports, so switching to Calcite code
 generation will have some impact to existing usage?

 -Rui

 On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
 wrote:

> To follow up on this, the PR is now in a reviewable state and I've
> added more tests for FLOOR and CEIL. Both work with a more extensive set 
> of
> arguments after this change. There are now 4 outstanding calcite PRs that
> get all the tests passing.
>
> Unfortunately there is no easy way to mix our current implementation
> and using Calcite's code generator.
>
> Andrew
>
> On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:
>
>> Awesome work, we should call Calcite operator functions if available.
>>
>> I haven't get time to read the PR yet, for those impacted would keep
>> existing implementation. One 

Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Mingmin Xu
1. Window start/end: Actually this is already provided in other ways and
the window in the SQL environment is unused and just waiting to be deleted.
So you can still access TUMBLE_START, etc. This is well-defined as a part
of the row so there's no semantic problem, but I think it should already
work.
*MM: Others work except SESSION_END();*

2. Pane information: I don't think access to pane info is enough for
correct results for a SQL join that triggers more than once. The pane info
is part of a Beam element, but these records just represent a kind of
changelog of the aggregation/join. The general solution is retractions.
Until we finish that, you need to follow the Join/CoGBK with custom logic ,
often a stateful DoFn to get the join results right. For example, if both
inputs are append-only relations and it is an equijoin, then you can do
this with a dedupe when you unpack the CoGbkResult. I am guessing this is
the main use case for BEAM-5204. Is it your use case?
*MM: my case is a self-join with SQL-only, written as [DISCARD_Pane JOIN
ACCU_Pane];*
*These UDFs is not a blocker, limitation in BEAM-5204 should be removed
directly IMO. With multiple-trigger assigned, developers need to handle the
output which is not complex with Java SDK, but very hard for SQL only
cases. *


On Thu, Nov 15, 2018 at 10:54 AM Kenneth Knowles  wrote:

> From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what
> you most care about is to have joins that trigger more than once per
> window. To accomplish it you hope to build an "escape hatch" from
> SQL/relational semantics to specialized Beam SQL semantics. It could make
> sense with extreme care.
>
> Separating the two parts:
>
> 1. Window start/end: Actually this is already provided in other ways and
> the window in the SQL environment is unused and just waiting to be deleted.
> So you can still access TUMBLE_START, etc. This is well-defined as a part
> of the row so there's no semantic problem, but I think it should already
> work.
>
> 2. Pane information: I don't think access to pane info is enough for
> correct results for a SQL join that triggers more than once. The pane info
> is part of a Beam element, but these records just represent a kind of
> changelog of the aggregation/join. The general solution is retractions.
> Until we finish that, you need to follow the Join/CoGBK with custom logic ,
> often a stateful DoFn to get the join results right. For example, if both
> inputs are append-only relations and it is an equijoin, then you can do
> this with a dedupe when you unpack the CoGbkResult. I am guessing this is
> the main use case for BEAM-5204. Is it your use case?
>
> Kenn
>
> On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu  wrote:
>
>> Raise this thread.
>> Seems there're more changes in the backend on how a FUNCTION is executed
>> in the backend, as noticed in #6996
>> :
>> 1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
>> 2. BeamSqlExpressionEnvironment are removed;
>>
>> Then,
>> 1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which
>> is great and duplicate work is worthless);
>> *2. no way to access Beam context now;*
>>
>> For *#2*, I think we need to find a way to expose it, at least our
>> UDF/UDAF should be able to access it to leverage the advantages of Beam
>> module.
>>
>> Any comments?
>>
>>
>> On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:
>>
>>> This is a so exciting change!
>>>
>>> Since we cannot mix current implementation with Calcite code generation,
>>> is there any case that Calcite code generation does not support but our
>>> current implementation supports, so switching to Calcite code generation
>>> will have some impact to existing usage?
>>>
>>> -Rui
>>>
>>> On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
>>> wrote:
>>>
 To follow up on this, the PR is now in a reviewable state and I've
 added more tests for FLOOR and CEIL. Both work with a more extensive set of
 arguments after this change. There are now 4 outstanding calcite PRs that
 get all the tests passing.

 Unfortunately there is no easy way to mix our current implementation
 and using Calcite's code generator.

 Andrew

 On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:

> Awesome work, we should call Calcite operator functions if available.
>
> I haven't get time to read the PR yet, for those impacted would keep
> existing implementation. One example is, I notice FLOOR/CEIL only supports
> months/years recently which is quite a surprise to me.
>
> Mingmin
>
> On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:
>
>> This is pretty amazing! Thank you for doing this!
>>
>> Regards,
>> Anton
>>
>> On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
>> wrote:
>>
>>> I've adapted Calcite's EnumerableCalc code generation to generate
>>> the BeamCalc DoFn. The primary 

Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-15 Thread Anton Kedin
One reason is that @AutoValue is not guaranteed to be retained at runtime:
https://github.com/google/auto/blob/master/value/src/main/java/com/google/auto/value/AutoValue.java#L44


On Thu, Nov 15, 2018 at 11:36 AM Kenneth Knowles  wrote:

> Just some low-level detail: If there is no @DefaultSchema annotation but
> it is an @AutoValue class, can schema inference go ahead with the
> AutoValueSchema? Then the user doesn't have to do anything.
>
> Kenn
>
> On Wed, Nov 14, 2018 at 6:14 AM Reuven Lax  wrote:
>
>> We already have a framework for ByteBuddy codegen for JavaBean Row
>> interfaces, which should hopefully be easy to extend AutoValue (and more
>> efficient than using reflection). I'm working on adding constructor support
>> to this right now.
>>
>> On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas  wrote:
>>
>>> Sounds, then, like we need to a define a new `AutoValueSchema extends
>>> SchemaProvider` and users would opt-in to this via the DefaultSchema
>>> annotation:
>>>
>>> @DefaultSchema(AutoValueSchema.class)
>>> @AutoValue
>>> public abstract MyClass ...
>>>
>>> Since we already have the JavaBean and JavaField reflection-based schema
>>> providers to use as a guide, it sounds like it may be best to try to
>>> implement this using reflection rather than implementing an AutoValue
>>> extension.
>>>
>>> A reflection-based approach here would hinge on being able to discover
>>> the package-private constructor for the concrete class and read its types.
>>> Those types would define the schema, and the fromRow impementation would
>>> call the discovered constructor.
>>>
>>> On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax  wrote:
>>>


 On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas 
 wrote:

> Reuven - A SchemaProvider makes sense. It's not clear to me, though,
> whether that's more limited than a Coder. Do all values of the schema have
> to be simple types, or does Beam SQL support nested schemas?
>

 Nested schemas, collection types (lists and maps), and collections of
 nested types are all supported.

>
> Put another way, would a user be able to create an AutoValue class
> comprised of simple types and then use that as a field inside another
> AutoValue class? I can see how that's possible with Coders, but not clear
> whether that's possible with Row schemas.
>

 Yes, this is explicitly supported.

>
> On Fri, Nov 9, 2018 at 8:22 PM Reuven Lax  wrote:
>
>> Hi Jeff,
>>
>> I would suggest a slightly different approach. Instead of generating
>> a coder, writing a SchemaProvider that generates a schema for AutoValue.
>> Once a PCollection has a schema, a coder is not needed (as Beam knows how
>> to encode any type with a schema), and it will work seamlessly with Beam
>> SQL (in fact you don't need to write a transform to turn it into a Row 
>> if a
>> schema is registered).
>>
>> We already do this for POJOs and basic JavaBeans. I'm happy to help
>> do this for AutoValue.
>>
>> Reuven
>>
>> On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas 
>> wrote:
>>
>>> Hi all - I'm looking for some review and commentary on a proposed
>>> design for providing built-in Coders for AutoValue classes. There's
>>> existing discussion in BEAM-1891 [0] about using AvroCoder, but that's
>>> blocked on incompatibility between AutoValue and Avro's reflection
>>> machinery that don't look resolvable.
>>>
>>> I wrote up a design document [1] that instead proposes using
>>> AutoValue's extension API to automatically generate a Coder for each
>>> AutoValue class that users generate. A similar technique could be used 
>>> to
>>> generate conversions to and from Row for use with BeamSql.
>>>
>>> I'd appreciate review of the design and thoughts on whether this
>>> seems feasible to support within the Beam codebase. I may be missing a
>>> simpler approach.
>>>
>>> [0] https://issues.apache.org/jira/browse/BEAM-1891
>>> [1]
>>> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>>>
>>


Re: [DISCUSS] Publish vendored dependencies independently

2018-11-15 Thread Thomas Weise
In case any of this affects how artifacts are published in general, please
make sure that publishing to 3rd party repo continues to work.

For example: ./gradlew :beam-runners-flink_2.11-job-server:publish
-PisRelease -PnoSigning -PdistMgmtSnapshotsUrl=
https://mycustomrepo/libs-release

Thanks,
Thomas


On Thu, Nov 15, 2018 at 11:27 AM Kenneth Knowles  wrote:

> Agree on the low bar. We should just make them all 0.x releases to send
> the right message (don't use, and no compatibility) and not worry as much
> about bad releases, which we
> would never actually depend on in the project.
>
> QQ: What does the new -P flag do? I was also hoping to eliminate the
> redundant -PisRelease flag, especially for vendored deps that should really
> be straight line.
>
> Kenn
>
> On Wed, Nov 14, 2018 at 12:38 PM Lukasz Cwik  wrote:
>
>> Its a small hassle but could be checked in with some changes, my example
>> commit was so that people could try it out as is.
>>
>> I'll work towards getting it checked in and then start a release for gRPC
>> and guava.
>>
>> On Wed, Nov 14, 2018 at 11:45 AM Scott Wegner  wrote:
>>
>>> Thanks for pushing this forward Luke.
>>>
>>> My understanding is that these vendored grpc artifacts will only be
>>> consumed directly by Beam internal components (as opposed to Beam user
>>> projects). So there should be a fairly low bar for publishing them. But
>>> perhaps we should have some short checklist for releasing them for
>>> consistency.
>>>
>>> One item I would suggest for such a checklist would be to publish
>>> artifacts from checked-in apache/beam sources and then tag the release
>>> commit. Is it possible to get your changes merged in first, or is there a
>>> chicken-and-egg problem that artifacts need to be published and available
>>> for consumption?
>>>
>>> On Wed, Nov 14, 2018 at 10:51 AM Lukasz Cwik  wrote:
>>>
 Note, I could also release the vendored version of guava 20 in
 preparation for us to start consuming it. Any concerns?

 On Tue, Nov 13, 2018 at 3:59 PM Lukasz Cwik  wrote:

> I have made some incremental progress on this and wanted to release
> our first vendored dependency of gRPC 1.13.1 since I was able to fix a 
> good
> number of the import/code completion errors that Intellij was 
> experiencing.
> I have published an example of what the jar/pom looks like in the Apache
> Staging repo:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-vendor-grpc-1_13_1/
>
> You can also checkout[1] and from a clean workspace run:
> ./gradlew :beam-vendor-grpc-1_13_1:publishToMavenLocal -PisRelease
> -PvendoredDependenciesOnly
> which will build a vendored version of gRPC that is published to your
> local maven repository. All the projects that depended on the gradle
> beam-vendor-grpc-1_13_1 project are now pointing at the Maven artifact
> org.apache.beam:beam-vendor-grpc-1_13_1:0.1
>
> I was planning to follow the Apache Beam release process but only for
> this specific artifact and start a vote thread if there aren't any 
> concerns.
>
> 1:
> https://github.com/lukecwik/incubator-beam/commit/4b1b7b40ef316559f81c42dfdd44da988db201e9
>
>
> On Thu, Oct 25, 2018 at 10:59 AM Lukasz Cwik  wrote:
>
>> Thats a good point Thomas, hadn't considered the lib/ case. I also am
>> recommending what Thomas is suggesting as well.
>>
>> On Thu, Oct 25, 2018 at 10:52 AM Maximilian Michels 
>> wrote:
>>
>>> On 25.10.18 19:23, Lukasz Cwik wrote:
>>> >
>>> >
>>> > On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels >> > > wrote:
>>> >
>>> > Question: How would a user end up with the same shaded
>>> dependency
>>> > twice?
>>> > The shaded dependencies are transitive dependencies of Beam
>>> and thus,
>>> > this shouldn't happen. Is this a safe-guard when running
>>> different
>>> > versions of Beam in the same JVM?
>>> >
>>> >
>>> > What I was referring to was that they aren't exactly the same
>>> dependency
>>> > but slightly different versions of the same dependency. Since we
>>> are
>>> > planning to vendor each dependency and its transitive dependencies
>>> as
>>> > part of the same jar, we can have  vendor-A that contains shaded
>>> > transitive-C 1.0 and vendor-B that contains transitive-C 2.0 both
>>> with
>>> > different package prefixes. It can be that transitive-C 1.0 and
>>> > transitive-C 2.0 can't be on the same classpath because they can't
>>> be
>>> > perfectly shaded due to JNI, java reflection, magical property
>>> > files/strings, ...
>>> >
>>>
>>> Ah yes. Get it. Thanks!
>>>
>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-15 Thread Kenneth Knowles
Just some low-level detail: If there is no @DefaultSchema annotation but it
is an @AutoValue class, can schema inference go ahead with the
AutoValueSchema? Then the user doesn't have to do anything.

Kenn

On Wed, Nov 14, 2018 at 6:14 AM Reuven Lax  wrote:

> We already have a framework for ByteBuddy codegen for JavaBean Row
> interfaces, which should hopefully be easy to extend AutoValue (and more
> efficient than using reflection). I'm working on adding constructor support
> to this right now.
>
> On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas  wrote:
>
>> Sounds, then, like we need to a define a new `AutoValueSchema extends
>> SchemaProvider` and users would opt-in to this via the DefaultSchema
>> annotation:
>>
>> @DefaultSchema(AutoValueSchema.class)
>> @AutoValue
>> public abstract MyClass ...
>>
>> Since we already have the JavaBean and JavaField reflection-based schema
>> providers to use as a guide, it sounds like it may be best to try to
>> implement this using reflection rather than implementing an AutoValue
>> extension.
>>
>> A reflection-based approach here would hinge on being able to discover
>> the package-private constructor for the concrete class and read its types.
>> Those types would define the schema, and the fromRow impementation would
>> call the discovered constructor.
>>
>> On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas 
>>> wrote:
>>>
 Reuven - A SchemaProvider makes sense. It's not clear to me, though,
 whether that's more limited than a Coder. Do all values of the schema have
 to be simple types, or does Beam SQL support nested schemas?

>>>
>>> Nested schemas, collection types (lists and maps), and collections of
>>> nested types are all supported.
>>>

 Put another way, would a user be able to create an AutoValue class
 comprised of simple types and then use that as a field inside another
 AutoValue class? I can see how that's possible with Coders, but not clear
 whether that's possible with Row schemas.

>>>
>>> Yes, this is explicitly supported.
>>>

 On Fri, Nov 9, 2018 at 8:22 PM Reuven Lax  wrote:

> Hi Jeff,
>
> I would suggest a slightly different approach. Instead of generating a
> coder, writing a SchemaProvider that generates a schema for AutoValue. 
> Once
> a PCollection has a schema, a coder is not needed (as Beam knows how to
> encode any type with a schema), and it will work seamlessly with Beam SQL
> (in fact you don't need to write a transform to turn it into a Row if a
> schema is registered).
>
> We already do this for POJOs and basic JavaBeans. I'm happy to help do
> this for AutoValue.
>
> Reuven
>
> On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas 
> wrote:
>
>> Hi all - I'm looking for some review and commentary on a proposed
>> design for providing built-in Coders for AutoValue classes. There's
>> existing discussion in BEAM-1891 [0] about using AvroCoder, but that's
>> blocked on incompatibility between AutoValue and Avro's reflection
>> machinery that don't look resolvable.
>>
>> I wrote up a design document [1] that instead proposes using
>> AutoValue's extension API to automatically generate a Coder for each
>> AutoValue class that users generate. A similar technique could be used to
>> generate conversions to and from Row for use with BeamSql.
>>
>> I'd appreciate review of the design and thoughts on whether this
>> seems feasible to support within the Beam codebase. I may be missing a
>> simpler approach.
>>
>> [0] https://issues.apache.org/jira/browse/BEAM-1891
>> [1]
>> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>>
>


Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Kenneth Knowles
>From https://issues.apache.org/jira/browse/BEAM-5204 it seems like what you
most care about is to have joins that trigger more than once per window. To
accomplish it you hope to build an "escape hatch" from SQL/relational
semantics to specialized Beam SQL semantics. It could make sense with
extreme care.

Separating the two parts:

1. Window start/end: Actually this is already provided in other ways and
the window in the SQL environment is unused and just waiting to be deleted.
So you can still access TUMBLE_START, etc. This is well-defined as a part
of the row so there's no semantic problem, but I think it should already
work.

2. Pane information: I don't think access to pane info is enough for
correct results for a SQL join that triggers more than once. The pane info
is part of a Beam element, but these records just represent a kind of
changelog of the aggregation/join. The general solution is retractions.
Until we finish that, you need to follow the Join/CoGBK with custom logic ,
often a stateful DoFn to get the join results right. For example, if both
inputs are append-only relations and it is an equijoin, then you can do
this with a dedupe when you unpack the CoGbkResult. I am guessing this is
the main use case for BEAM-5204. Is it your use case?

Kenn

On Thu, Nov 15, 2018 at 10:08 AM Mingmin Xu  wrote:

> Raise this thread.
> Seems there're more changes in the backend on how a FUNCTION is executed
> in the backend, as noticed in #6996
> :
> 1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
> 2. BeamSqlExpressionEnvironment are removed;
>
> Then,
> 1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which is
> great and duplicate work is worthless);
> *2. no way to access Beam context now;*
>
> For *#2*, I think we need to find a way to expose it, at least our
> UDF/UDAF should be able to access it to leverage the advantages of Beam
> module.
>
> Any comments?
>
>
> On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:
>
>> This is a so exciting change!
>>
>> Since we cannot mix current implementation with Calcite code generation,
>> is there any case that Calcite code generation does not support but our
>> current implementation supports, so switching to Calcite code generation
>> will have some impact to existing usage?
>>
>> -Rui
>>
>> On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
>> wrote:
>>
>>> To follow up on this, the PR is now in a reviewable state and I've added
>>> more tests for FLOOR and CEIL. Both work with a more extensive set of
>>> arguments after this change. There are now 4 outstanding calcite PRs that
>>> get all the tests passing.
>>>
>>> Unfortunately there is no easy way to mix our current implementation and
>>> using Calcite's code generator.
>>>
>>> Andrew
>>>
>>> On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:
>>>
 Awesome work, we should call Calcite operator functions if available.

 I haven't get time to read the PR yet, for those impacted would keep
 existing implementation. One example is, I notice FLOOR/CEIL only supports
 months/years recently which is quite a surprise to me.

 Mingmin

 On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:

> This is pretty amazing! Thank you for doing this!
>
> Regards,
> Anton
>
> On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
> wrote:
>
>> I've adapted Calcite's EnumerableCalc code generation to generate the
>> BeamCalc DoFn. The primary purpose behind this change is so we can take
>> advantage of Calcite's extensive SQL operator implementation. This 
>> deletes
>> ~11000 lines of code from Beam (with ~350 added), significantly increases
>> the set of supported SQL operators, and improves performance and
>> correctness of currently supported operators. Here is my work in 
>> progress:
>> https://github.com/apache/beam/pull/6417
>>
>> There are a few bugs in Calcite that this has exposed:
>>
>> Fixed in Calcite master:
>>
>>- CALCITE-2321
>> - The type
>>of a union of CHAR columns of different lengths should be VARCHAR
>>- CALCITE-2447
>> - Some
>>POWER, ATAN2 functions fail with NoSuchMethodException
>>
>> Pending PRs:
>>
>>- CALCITE-2529
>> - linq4j
>>should promote integer to floating point when generating function 
>> calls
>>- CALCITE-2530
>> - TRIM
>>function does not throw exception when the length of trim character 
>> is not
>>1(one)
>>
>> More work:
>>
>>- CALCITE-2404
>> - Accessing
>>structured-types is not 

Re: Migrating Beam SQL to Calcite's code generation

2018-11-15 Thread Mingmin Xu
Raise this thread.
Seems there're more changes in the backend on how a FUNCTION is executed in
the backend, as noticed in #6996 :

1. BeamSqlExpression and BeamSqlExpressionExecutor are removed;
2. BeamSqlExpressionEnvironment are removed;

Then,
1. for Calcite defined FUNCTIONS, it uses Calcite generated code (which is
great and duplicate work is worthless);
*2. no way to access Beam context now;*

For *#2*, I think we need to find a way to expose it, at least our UDF/UDAF
should be able to access it to leverage the advantages of Beam module.

Any comments?


On Wed, Sep 19, 2018 at 2:55 PM Rui Wang  wrote:

> This is a so exciting change!
>
> Since we cannot mix current implementation with Calcite code generation,
> is there any case that Calcite code generation does not support but our
> current implementation supports, so switching to Calcite code generation
> will have some impact to existing usage?
>
> -Rui
>
> On Wed, Sep 19, 2018 at 11:53 AM Andrew Pilloud 
> wrote:
>
>> To follow up on this, the PR is now in a reviewable state and I've added
>> more tests for FLOOR and CEIL. Both work with a more extensive set of
>> arguments after this change. There are now 4 outstanding calcite PRs that
>> get all the tests passing.
>>
>> Unfortunately there is no easy way to mix our current implementation and
>> using Calcite's code generator.
>>
>> Andrew
>>
>> On Mon, Sep 17, 2018 at 3:22 PM Mingmin Xu  wrote:
>>
>>> Awesome work, we should call Calcite operator functions if available.
>>>
>>> I haven't get time to read the PR yet, for those impacted would keep
>>> existing implementation. One example is, I notice FLOOR/CEIL only supports
>>> months/years recently which is quite a surprise to me.
>>>
>>> Mingmin
>>>
>>> On Mon, Sep 17, 2018 at 3:03 PM Anton Kedin  wrote:
>>>
 This is pretty amazing! Thank you for doing this!

 Regards,
 Anton

 On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud 
 wrote:

> I've adapted Calcite's EnumerableCalc code generation to generate the
> BeamCalc DoFn. The primary purpose behind this change is so we can take
> advantage of Calcite's extensive SQL operator implementation. This deletes
> ~11000 lines of code from Beam (with ~350 added), significantly increases
> the set of supported SQL operators, and improves performance and
> correctness of currently supported operators. Here is my work in progress:
> https://github.com/apache/beam/pull/6417
>
> There are a few bugs in Calcite that this has exposed:
>
> Fixed in Calcite master:
>
>- CALCITE-2321 
>- The type of a union of CHAR columns of different lengths should be 
> VARCHAR
>- CALCITE-2447 
>- Some POWER, ATAN2 functions fail with NoSuchMethodException
>
> Pending PRs:
>
>- CALCITE-2529 
>- linq4j should promote integer to floating point when generating 
> function
>calls
>- CALCITE-2530 
>- TRIM function does not throw exception when the length of trim 
> character
>is not 1(one)
>
> More work:
>
>- CALCITE-2404 
>- Accessing structured-types is not implemented by the runtime
>- (none yet) - Support multi character TRIM extension in Calcite
>
> I would like to push these changes in with these minor regressions. Do
> any of these Calcite bugs block this functionality being adding to Beam?
>
> Andrew
>

>>>
>>> --
>>> 
>>> Mingmin
>>>
>>

-- 

Mingmin


Re: [DISCUSS] SplittableDoFn Java SDK User Facing API

2018-11-15 Thread Ismaël Mejía
Some late comments, and my pre excuses if some questions look silly,
but the last documents were a lot of info that I have not yet fully
digested.

I have some questions about the ‘new’ Backlog concept following a
quick look at the PR
https://github.com/apache/beam/pull/6969/files

1. Is the Backlog a specific concept for each IO? Or in other words:
ByteKeyRestrictionTracker can be used by HBase and Bigtable, but I am
assuming from what I could understand that the Backlog implementation
will be data store specific, is this the case? or it can be in some
case generalized (for example for Filesystems)?

2. Since the backlog is a byte[] this means that it is up to the user
to give it a meaning depending on the situation, is this correct? Also
since splitRestriction has now the Backlog as an argument, what do we
expect the person that implements this method in a DoFn to do ideally
with it? Maybe a more concrete example of how things fit for
File/Offset or HBase/Bigtable/ByteKey will be helpful (maybe also for
the BundleFinalizer concept too).

3. By default all Restrictions are assumed to be unbounded but there
is this new Restrictions.IsBounded method, can’t this behavior be
inferred (adapted) from the DoFn UnboundedPerElement/Bounded
annotation or are these independent concepts?

Extra unrelated comment:
Since SDF is still @Experimental we should probably rename
OffsetRangeTracker and ByteKeyRangeTracker into the RestrictionTracker
suffix (I don’t know why they share the RangeTracker suffix for the
new trackers, WDYT?
On Wed, Nov 7, 2018 at 5:47 PM Lukasz Cwik  wrote:
>
>
>
> On Wed, Nov 7, 2018 at 8:33 AM Robert Bradshaw  wrote:
>>
>> I think that not returning the users specific subclass should be fine.
>> Does the removal of markDone imply that the consumer always knows a
>> "final" key to claim on any given restriction?
>
>
> Yes, each restriction needs to support claiming a "final" key that would make 
> the restriction "done". In the BigTable/HBase case it is the empty key "", 
> for files it can be a file offset beyond the end of the file. Generally, 
> restriction trackers written by SDF authors could also take an instance of an 
> object that they can compare instance equality against for a final key. 
> Alternatively we could allow restriction trackers to implement markDone() but 
> would need the SDK have knowledge of the method by having the 
> RestrictionTracker implement interface, extend abstract base class, or 
> reflectively found so that we would be able to wrap it to provide 
> synchronization guarantees. I had toyed with the idea of using something like 
> the ProxyInvocationHandler that backs PipelineOptions to be able to provide a 
> modified version of the users instance that had the appropriate 
> synchronization guarantees but couldn't get it to work.
>
>>
>> On Wed, Nov 7, 2018 at 1:45 AM Lukasz Cwik  wrote:
>> >
>> > I have started to work on how to change the user facing API within the 
>> > Java SDK to support splitting/checkpointing[1], backlog reporting[2] and 
>> > bundle finalization[3].
>> >
>> > I have this PR[4] which contains minimal interface/type definitions to 
>> > convey how the API surface would change with these 4 changes:
>> > 1) Exposes the ability for @SplitRestriction to take a backlog suggestion 
>> > on how to perform splitting and for how many restrictions should be 
>> > returned.
>> > 2) Adds the ability for RestrictionTrackers to report backlog
>> > 3) Updates @ProcessElement to be required to take a generic 
>> > RestrictionTracker instead of the users own restriction tracker type.
>> > 4) Adds the ability for @StartBundle/@ProcessElement/@FinishBundle to 
>> > register a callback that is invoked after bundle finalization.
>> >
>> > The details are in the javadoc comments as to how I would expect the 
>> > contract to play out.
>> > Feel free to comment on the ML/PR around the contract and after the 
>> > feedback is received/digested/implemented, I would like to get the changes 
>> > submitted so that work can start  towards providing an implementation in 
>> > the Java SDK, Python SDK, and Go SDK and the shared runner portability 
>> > library.
>> >
>> > I would like to call out special attention to 3 since with this change it 
>> > will enable us to remove the synchronization requirement for users as we 
>> > will wrap the underlying restriction tracker allowing us to add 
>> > appropriate synchronization as needed and also to watch any calls that 
>> > pass through the object such as the claim calls. I also believe this 
>> > prevents people from writing RestrictionTrackers where the contract of 
>> > tryClaim is subverted since markDone is outside the purview of tryClaim as 
>> > in ByteKeyRangeTracker[5].
>> >
>> > 1: https://s.apache.org/beam-checkpoint-and-split-bundles
>> > 2: https://s.apache.org/beam-bundles-backlog-splitting
>> > 3: https://s.apache.org/beam-finalizing-bundles
>> > 4: https://github.com/apache/beam/pull/6969
>> > 5: 

Re: Python profiling

2018-11-15 Thread Thomas Weise
Hi Robert,

This is great. It should be added to our Python documentation because users
will like need this!

After I installed gprof2dot I'm still prompted to install:

"Please install gprof2dot and dot for profile renderings."

Also is there a way to run a pipeline unmodified with fn_api_runner? (For
those interested in profiling the SDK worker.)

It works with direct runner, but "FnApiRunner" isn't currently supported as
--runner argument:

python -m apache_beam.examples.wordcount \
  --input=/etc/profile \
  --output=/tmp/py-wordcount-direct \
  *--runner=FnApiRunner* \
  --streaming \
  --profile_cpu --profile_location=./build/pyprofile

Thanks,
Thomas


On Mon, Nov 5, 2018 at 7:15 PM Ankur Goenka  wrote:

> All containers are destroyed by default on termination so to analyze
> profiling data for portable runners, either disable container cleanup
> (using --retainDockerContainers=true) or use remote distributed file
> system path.
>
> On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw 
> wrote:
>
>> Any portable runner should pick it up automatically.
>> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang 
>> wrote:
>> >
>> > Cool ! Can we document it somewhere such that other Runners could pick
>> it up later ?
>> >
>> > Thanks,
>> > Manu Zhang
>> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels ,
>> wrote:
>> >
>> > This looks very helpful for debugging performance of portable pipelines.
>> > Great work!
>> >
>> > Enabling local directories for Flink or other portable Runners would be
>> > useful for debugging, e.g. per
>> > https://issues.apache.org/jira/browse/BEAM-5440
>> >
>> > On 26.10.18 18:08, Robert Bradshaw wrote:
>> >
>> > Now that we've (mostly) moved from features to performance for
>> > BeamPython-on-Flink, I've been doing some profiling of Python code,
>> > and thought it may be useful for others as well (both those working on
>> > the SDK, and users who want to understand their own code), so I've
>> > tried to wrap this up into something useful.
>> >
>> > Python already had some existing profile options that we used with
>> > Dataflow, specifically --profile_cpu and --profile_location. I've
>> > hooked these up to both the DirectRunner and the SDK Harness Worker.
>> > One can now run commands like
>> >
>> > python -m apache_beam.examples.wordcount
>> > --output=counts.txt--profile_cpu --profile_location=path/to/directory
>> >
>> > and get nice graphs like the one attached. (Here the bulk of the time
>> > is spent reading from the default input in gcs. Another hint for
>> > reading the graph is that due to fusion the call graph is cyclic,
>> > passing through operations:86:receive for every output.)
>> >
>> > The raw python profile stats [1] are produced in that directory, along
>> > with a dot graph and an svg if both dot and gprof2dot are installed.
>> > There is also an important option --direct_runner_bundle_repeat which
>> > can be set to gain more accurate profiles on smaller data sets by
>> > re-playing the bundle without the (non-trivial) one-time setup costs.
>> >
>> > These flags also work on portability runners such as Flink, where the
>> > directory must be set to a distributed filesystem. Each bundle
>> > produces its own profile in that directory, and they can be
>> > concatenated and manually fed into tools like below. In that case
>> > there is a --profile_sample_rate which can be set to avoid profiling
>> > every single bundle (e.g. for a production job).
>> >
>> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's
>> useful.
>> >
>> > - Robert
>> >
>> >
>> > [1] https://docs.python.org/2/library/profile.html
>> > [2] https://github.com/jrfonseca/gprof2dot
>> >
>>
>


Wiki edit access

2018-11-15 Thread Wout Scheepers
Can anyone give me edit access for the wiki?

Thanks,
Wout


Re: How to use "PortableRunner" in Python SDK?

2018-11-15 Thread Maximilian Michels

1) The default behavior, where PortableRunner starts a flink server. It is 
confusing to new users
It does that only if no JobServer endpoint is specified. AFAIK there a 
problems with the bootstrapping, it can definitely be improved.



2) All the related docs and inline comments.  Similarly, it could be very 
confusing connecting PortableRunner to Flink server.

+1 We definitely need to improve docs and usability.

3) [Probably no longer an issue].   I couldn't make the flink server example working.  And I could not make example working on Java-ULR either. 

AFAIK Java URL hasn't received love for a long time.

-Max

On 14.11.18 20:57, Ruoyun Huang wrote:

To answer Maximilian's question.

I am using Linux, debian distribution.

It probably sounded too much when I used the word 'planned merge'. What 
I really meant entails less change than it sounds. More specifically:


1) The default behavior, where PortableRunner starts a flink server.  It 
is confusing to new users.
2) All the related docs and inline comments.  Similarly, it could be 
very confusing connecting PortableRunner to Flink server.
3) [Probably no longer an issue].   I couldn't make the flink server 
example working.  And I could not make example working on Java-ULR 
either.  Both will require debugging for resolutions.  Thus I figured 
maybe let us only focus on one single thing: the java-ULR part, without 
worrying about Flink-server.   Again, looks like this may not be a valid 
concern, given flink part is most likely due to my setup.



On Wed, Nov 14, 2018 at 3:30 AM Maximilian Michels > wrote:


Hi Ruoyun,

I just ran the wordcount locally using the instructions on the page.
I've tried the local file system and GCS. Both times it ran
successfully
and produced valid output.

I'm assuming there is some problem with your setup. Which platform are
you using? I'm on MacOS.

Could you expand on the planned merge? From my understanding we will
always need PortableRunner in Python to be able to submit against the
Beam JobServer.

Thanks,
Max

On 14.11.18 00:39, Ruoyun Huang wrote:
 > A quick follow-up on using current PortableRunner.
 >
 > I followed the exact three steps as Ankur and Maximilian shared in
 > https://beam.apache.org/roadmap/portability/#python-on-flink  ; 
  The

 > wordcount example keeps hanging after 10 minutes.  I also tried
 > specifying explicit input/output args, either using gcs folder or
local
 > file system, but none of them works.
 >
 > Spent some time looking into it but conclusion yet.  At this point
 > though, I guess it does not matter much any more, given we
already have
 > the plan of merging PortableRunner into using java reference runner
 > (i.e. :beam-runners-reference-job-server).
 >
 > Still appreciated if someone can try out the python-on-flink
 >
instructions

 > in case it is just due to my local machine setup.  Thanks!
 >
 >
 >
 > On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang mailto:ruo...@google.com>
 > >> wrote:
 >
 >     Thanks Maximilian!
 >
 >     I am working on migrating existing PortableRunner to using
java ULR
 >     (Link to Notes
 >   
  ).

 >     If this issue is non-trivial to solve, I would vote for removing
 >     this default behavior as part of the consolidation.
 >
 >     On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels
mailto:m...@apache.org>
 >     >> wrote:
 >
 >         In the long run, we should get rid of the
Docker-inside-Docker
 >         approach,
 >         which was only intended for testing anyways. It would be
cleaner to
 >         start the SDK harness container alongside with JobServer
container.
 >
 >         Short term, I think it should be easy to either fix the
 >         permissions of
 >         the mounted "docker" executable or use a Docker image for the
 >         JobServer
 >         which comes with Docker pre-installed.
 >
 >         JIRA: https://issues.apache.org/jira/browse/BEAM-6020
 >
 >         Thanks for reporting this Ruoyun!
 >
 >         -Max
 >
 >         On 08.11.18 00:10, Ruoyun Huang wrote:
 >          > Thanks Ankur and Maximilian.
 >          >
 >          > Just for reference in case other people encountering
the same
 >         error
 >          > message, the "permission denied" error in my original
email
 >         is exactly
 >          > due to dockerinsidedocker issue that Ankur mentioned.
 >         Thanks Ankur!
 >          > 

Re: Bigquery streaming TableRow size limit

2018-11-15 Thread Wout Scheepers
Thanks for your thoughts.

Also, I’m doing something similar when streaming data into partitioned tables.
From [1]:
“ When the data is streamed, data between 7 days in the past and 3 days in the 
future is placed in the streaming buffer, and then it is extracted to the 
corresponding partitions.”

I added a check to see if the event time is within this timebound. If not, a 
load job is triggered. This can happen when we replay old data.

Do you also think this would be worth adding to BigqueryIO?
If so, I’ll try to create a PR for both features.

Thanks,
Wout

[1] : 
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables


From: Reuven Lax 
Reply-To: "dev@beam.apache.org" 
Date: Wednesday, 14 November 2018 at 14:51
To: "dev@beam.apache.org" 
Subject: Re: Bigquery streaming TableRow size limit

Generally I would agree, but the consequences here of a mistake are severe. Not 
only will the beam pipeline get stuck for 24 hours, _anything_ else in the 
user's GCP project that tries to load data into BigQuery will also fail for the 
next 24 hours. Given the severity, I think it's best to make the user opt into 
this behavior rather than do it magically.

On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik 
mailto:lc...@google.com>> wrote:
I would rather not have the builder method and run into the quota issue then 
require the builder method and still run into quota issues.

On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax 
mailto:re...@google.com>> wrote:
I'm a bit worried about making this automatic, as it can have unexpected side 
effects on BigQuery load-job quota. This is a 24-hour quota, so if it's 
accidentally exceeded all load jobs for the project may be blocked for the next 
24 hours. However if the user opts in (possibly via .a builder method), this 
seems like it could be automatic.

Reuven

On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik 
mailto:lc...@google.com>> wrote:
Having data ingestion work without needing to worry about how big the blobs are 
would be nice if it was automatic for users.

On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers 
mailto:wout.scheep...@vente-exclusive.com>> 
wrote:
Hey all,

The TableRow size limit is 1mb when streaming into bigquery.
To prevent data loss, I’m going to implement a TableRow size check and add a 
fan out to do a bigquery load job in case the size is above the limit.
Of course this load job would be windowed.

I know it doesn’t make sense to stream data bigger than 1mb, but as we’re using 
pub sub and want to make sure no data loss happens whatsoever, I’ll need to 
implement it.

Is this functionality any of you would like to see in BigqueryIO itself?
Or do you think my use case is too specific and implementing my solution around 
BigqueryIO will suffice.

Thanks for your thoughts,
Wout