from:"Anton Kedin"

Re: [DISCUSS] Backwards compatibility of @Experimental features

2019-08-12 Thread Anton Kedin

Concrete user feedback:
https://stackoverflow.com/questions/57453473/was-the-beamrecord-type-removed-from-apache-beam/57463708#57463708
Short version: we moved BeamRecord from Beam SQL to core Beam and renamed
it to Row (still @Experimental, BTW). But we never mentioned it anywhere
where it would be easy for users to find. Highlighting deprecations and
major shifts of public APIs in the release blog post (and in Javadoc) can
help make this traceable at the very least.

Regards,
Anton

On Wed, May 8, 2019 at 1:42 PM Kenneth Knowles  wrote:

>
>
> On Wed, May 8, 2019 at 9:29 AM Ahmet Altay  wrote:
>
>>
>>
>> *From: *Kenneth Knowles 
>> *Date: *Wed, May 8, 2019 at 9:24 AM
>> *To: *dev
>>
>>
>>>
>>> On Fri, Apr 19, 2019 at 3:09 AM Ismaël Mejía  wrote:
>>>
 It seems we mostly agree that @Experimental is important, and that API
 changes (removals) on experimental features should happen quickly but still
 give some time to users so the Experimental purpose is not lost.

 Ahmet proposal given our current release calendar is close to 2
 releases. Can we settle this on 2 releases as a 'minimum time' before
 removal? (This will let maintainers the option to choose to support it more
 time if they want as discussed in the related KafkaIO thread but still be
 friendly with users).

 Do we agree?

>>>
>>> This sounds pretty good to me.
>>>
>>
>> Sounds good to me too.
>>
>>
>>> How can we manage this? Right now we tie most activities (like
>>> re-triaging flakes) to the release process, since it is the only thing that
>>> happens regularly for the community. If we don't have some forcing then I
>>> expect the whole thing will just be forgotten.
>>>
>>
>> Can we pre-create a list of future releases in JIRA, and for each
>> experimental feature require that a JIRA issue is created for resolving the
>> experimental status and tag it with the release that will happen after the
>> minimum time period?
>>
>
> Great idea. I just created the 2.15.0 release so it reaches far enough
> ahead for right now.
>
> Kenn
>
>
>>
>>> Kenn
>>>
>>>

 Note: for the other subjects (e.g. when an Experimental feature should
 become not experimental) I think we will hardly find an agreement so I
 think this should be treated in a per case basis by the maintainers, but if
 you want to follow up on that discussion we can open another thread for
 this.



 On Sat, Apr 6, 2019 at 1:04 AM Ahmet Altay  wrote:

> I agree that Experimental feature is still very useful. I was trying
> to argue that we diluted its value so +1 to reclaim that.
>
> Back to the original question, in my opinion removing existing
> "experimental and deprecated" features in n=1 release will confuse users.
> This will likely be a surprise to them because we have been maintaining
> this state release after release now. I would propose in the next release
> warning users of such a change happening and give them at least 3 months 
> to
> upgrade to suggested newer paths. In the future we can have a shorter
> timelines assuming that we will set the user expectations right.
>
> On Fri, Apr 5, 2019 at 3:01 PM Ismaël Mejía  wrote:
>
>> I agree 100% with Kenneth on the multiple advantages that the
>> Experimental feature gave us. I also can count multiple places where this
>> has been essential in other modules than core. I disagree on the fact 
>> that
>> the @Experimental annotation has lost sense, it is simply ill defined, 
>> and
>> probably it is by design because its advantages come from it.
>>
>> Most of the topics in this thread are a consequence of the this loose
>> definition, e.g. (1) not defining how a feature becomes stable, and (2)
>> what to do when we want to remove an experimental feature, are ideas that
>> we need to decide if we define just continue to handle as we do today.
>>
>> Defining a target for graduating an Experimental feature is a bit too
>> aggressive with not much benefit, in this case we could be losing the
>> advantages of Experimental (save if we could change the proposed version 
>> in
>> the future). This probably makes sense for the removal of features but
>> makes less sense to decide when some feature becomes stable. Of course in
>> the case of the core SDKs packages this is probably more critical but
>> nothing guarantees that things will be ready when we expect too. When 
>> will
>> we tag for stability things like SDF or portability APIs?. We cannot
>> predict the future for completion of features.
>>
>> Nobody has mentioned the LTS releases couldn’t be these like the
>> middle points for these decisions? That at least will give LTS some value
>> because so far I still have issues to understand the value of this idea
>> given that we can do a minor release of any pre-released version.

Re: [Update] Beam 2.15 Release Progress

2019-08-07 Thread Anton Kedin

Perf regression is seemingly gone now. If this is caused by a PR we might
want to find out which one and cherry-pick it into the release.

Regards,
Anton

On Tue, Aug 6, 2019 at 4:52 PM Yifan Zou  wrote:

> Hi,
>
> There is a perf regression on SQL Query3 on dataflow runner. This was
> treated as a release blocker. We would appreciate if someone could look
> into this issue?
>
> For more details, please see Anton's email [1] and JIRA [2].
> [1]
> https://lists.apache.org/thread.html/5441431cb2cf8fb445a2e30e6b2a8feb199d189755cf12b0c86fb1c8@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-7906
>
> Regards.
> Yifan
>
>
> On Mon, Aug 5, 2019 at 10:35 AM Yifan Zou  wrote:
>
>> Hi,
>>
>> I've verified release branch, and all Pre/Post-commits passed. The next
>> step would be verifying the javadoc.
>> We still have a few blocking issues,
>> https://issues.apache.org/jira/browse/BEAM-7880?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.15.0
>> .
>> Please ping me once the ticket got fixed, or update them to the next
>> version to unblock the release. Thanks.
>>
>> Yifan
>>
>> On Wed, Jul 31, 2019 at 4:33 PM Yifan Zou  wrote:
>>
>>> Snapshots are published
>>> http://repository.apache.org/content/groups/snapshots/org/apache/beam/.
>>>
>>> On Wed, Jul 31, 2019 at 1:28 PM Yifan Zou  wrote:
>>>
 Hi,

 The release branch is cut
 https://github.com/apache/beam/tree/release-2.15.0.
 The next step would be building snapshots and verify release branch.

 Regards.
 Yifan

>>>

Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-06 Thread Anton Kedin

Congrats!

On Tue, Aug 6, 2019, 9:37 AM Ankur Goenka  wrote:

> Congratulations Kyle!
>
> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay  wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer: 
>> Kyle
>> Weaver.
>>
>> Kyle has been contributing to Beam for a while now. And in that time
>> period Kyle got the portable spark runner feature complete for batch
>> processing. [1]
>>
>> In consideration of Kyle's contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer [2].
>>
>> Thank you, Kyle, for your contributions and looking forward to many more!
>>
>> Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://lists.apache.org/thread.html/c43678fc24c9a1dc9f48c51c51950aedcb9bc0fd3b633df16c3d595a@%3Cuser.beam.apache.org%3E
>> [2] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>

Re: [ANNOUNCE] New committer: Rui Wang

2019-08-06 Thread Anton Kedin

Congrats!

On Tue, Aug 6, 2019, 9:36 AM Ankur Goenka  wrote:

> Congratulations Rui!
> Well deserved 
>
> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay  wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer: Rui
>> Wang.
>>
>> Rui has been an active contributor since May 2018. Rui has been very
>> active in Beam SQL [1] and continues to help out on user@ and
>> StackOverflow. Rui is one of the top answerers for apache-beam tag [2].
>>
>> In consideration of Rui's contributions, the Beam PMC trusts him with the
>> responsibilities of a Beam committer [3].
>>
>> Thank you, Rui, for your contributions and looking forward to many more!
>>
>> Ahmet, on behalf of the Apache Beam PMC
>>
>> [1] https://github.com/apache/beam/pulls?q=is%3Apr+author%3Aamaliujia
>> [2] https://stackoverflow.com/tags/apache-beam/topusers
>> [3] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>

Perf regression

2019-08-06 Thread Anton Kedin

I noticed there is a perf regression that appeared on Nexmark dashboard on
July 30. It seems to be limited to SQL Query3 and most obvious in Dataflow
runner. Direct runner shows a slight increase as well but Spark runner
doesn't seem to be affected. I looked at the history of changes of Beam SQL
and don't see an obvious culprit that happened around that date.

This happened just before the 2.15.0 release branch was cut (July 31) so I
marked it as a blocker for now. More details in the Jira:
https://issues.apache.org/jira/browse/BEAM-7906

If anyone has any ideas or suspects, please comment on the jira.

Regards,
Anton

[RESULT] [VOTE] Release 2.14.0, release candidate #1

2019-07-31 Thread Anton Kedin

I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 4 of which are binding (in order):
* Ahmet (al...@google.com);
* Robert (rober...@google.com);
* Pablo (pabl...@google.com);
* Ismaël (ieme...@gmail.com);

There are no disapproving votes.

Thanks everyone!

Next step is to finalize the release (merge the docs/website/blog PRs,
publish artifacts), I will be working on it tomorrow (PST, Seattle time).

Please let me know if you have any questions, concerns or if I made a
mistake somewhere.

Regards,
Anton

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-30 Thread Anton Kedin

Ran various postcommits, validates runners, and nexmark against the release
branch. All looks good so far.

Will take another look at the docs/blog and the nexmark numbers tomorrow,
but if nothing comes up I will close the vote tomorrow (Wednesday) by 6pm
PST (= Thursday 01:00am UTC) since it's over 72hours since the vote has
started and we have a number of +1s including PMC members and no -1s.

Regards,
Anton

On Tue, Jul 30, 2019 at 8:13 PM Valentyn Tymofieiev 
wrote:

> I also ran unit tests for Python 3.7 and they passed as well. Cython tests
> for python3.7 require  `apt-get install python3.7-dev`.
>
> On Wed, Jul 31, 2019 at 3:16 AM Pablo Estrada  wrote:
>
>> +1
>>
>> I installed from source, and ran unit tests for Python in 2.7, 3.5, 3.6.
>>
>> Also ran a number of integration tests on Py 3.5 on Dataflow and
>> DirectRunner.
>> Best
>> -P.
>>
>> On Tue, Jul 30, 2019 at 11:09 AM Hannah Jiang 
>> wrote:
>>
>>> I checked Py3 tests using .zip, mainly with direct runners, and
>>> everything looks good, so +1.
>>>
>>> On Tue, Jul 30, 2019 at 2:08 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> I checked all the artifact signatures and ran a couple test pipelines
>>>> with the wheels (Py2 and Py3) and everything looked good to me, so +1.
>>>>
>>>> On Mon, Jul 29, 2019 at 8:29 PM Valentyn Tymofieiev <
>>>> valen...@google.com> wrote:
>>>>
>>>>> I have checked Python 3 batch and streaming quickstarts on Dataflow
>>>>> runner using .zip and wheel distributions. So far +1 from me.
>>>>>
>>>>> On Mon, Jul 29, 2019 at 7:53 PM Ahmet Altay  wrote:
>>>>>
>>>>>> +1, validated python 2 quickstarts.
>>>>>>
>>>>>> On Fri, Jul 26, 2019 at 5:46 PM Ahmet Altay  wrote:
>>>>>>
>>>>>>> To confirm, I manuall validated leader board on python. It is
>>>>>>> working.
>>>>>>>
>>>>>>> On Fri, Jul 26, 2019 at 5:23 PM Yifan Zou 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> AFAIK, there should not be any special prerequisites for this.
>>>>>>>> Things the script does including:
>>>>>>>> 1. download the python rc in zip
>>>>>>>> 2. start virtualenv and install the sdk.
>>>>>>>> 3. verify hash.
>>>>>>>> 4. config settings.xml and start a Java pubsub message injector.
>>>>>>>> 5. run game examples and validate.
>>>>>>>>
>>>>>>>> Could you double check if the sdk was installed properly (step 1&2)?
>>>>>>>>
>>>>>>>
>>>>>>> I also guessing this is the case. Probably something earlier in the
>>>>>>> validation script did not run as expected.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Yifan
>>>>>>>>
>>>>>>>> On Fri, Jul 26, 2019 at 2:38 PM Anton Kedin 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Validation script fails for me when I try to run [1] python
>>>>>>>>> leaderboard with direct runner:
>>>>>>>>>
>>>>>>>>> ```
>>>>>>>>> *
>>>>>>>>> * Running Python Leaderboard with DirectRunner
>>>>>>>>> *
>>>>>>>>> /usr/bin/python: No module named apache_beam.examples.complete.game
>>>>>>>>> ```
>>>>>>>>>
>>>>>>>>> If someone has more context, what are the prerequisites for this
>>>>>>>>> step? How does it look up the module?
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/blob/master/release/src/main/scripts/run_rc_validation.sh#L424
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Anton
>>>>>>>>>
>>>>>>>>> On Fri, Jul 26, 2019 at 10:23 AM Anton Kedin 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Cool, will make the post and will update the release guide as
>>>>>>>>>> well then
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 26, 2019 at 10:20 AM Chad Dombrova 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think the release guide needs to be updated to remove the
>>>>>>>>>>>> optionality of blog creation and avoid confusion. Thanks for 
>>>>>>>>>>>> pointing that
>>>>>>>>>>>> out.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>>

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-26 Thread Anton Kedin

Cool, will make the post and will update the release guide as well then

On Fri, Jul 26, 2019 at 10:20 AM Chad Dombrova  wrote:

> I think the release guide needs to be updated to remove the optionality of
>> blog creation and avoid confusion. Thanks for pointing that out.
>>
>
> +1
>
>

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-26 Thread Anton Kedin

Hi Thomas, I haven't made it. I read that step of the guide as optional ("..if
needed for this particular release..."). I am not sure if anything specific
needs to be announced or highlighted for 2.14. I can go over the closed
Jiras and create a blog post if it's expected.

Regards,
Anton

On Fri, Jul 26, 2019 at 9:38 AM Thomas Weise  wrote:

> Hi Anton,
>
> Thanks for working on the release.
>
> I don't find the release blog in https://github.com/apache/beam/pull/9157 or
> elsewhere?
>
> This should be part of the release candidate [1] and I wonder why we keep
> on missing it in RCs. Is there something that needs be be fixed in [1]?
>
> The reason why I now check for this as one of the first items is that we
> traditionally have done poorly communicating releases to users when this is
> actually very important. The blog needs many eyes to make sure we capture
> what matters in a way that makes sense to users.
>
> Thomas
>
>
>
>
>
>
> [1]
> https://beam.apache.org/contribute/release-guide/#write-the-beam-blog-post-and-create-a-pull-request
>
>
>
> On Thu, Jul 25, 2019 at 4:25 PM Rui Wang  wrote:
>
>> Tried to verify RC1 by running Nexmark on Dataflow but found it's broken
>> (at least based commands from Running+Nexmark
>> <https://cwiki.apache.org/confluence/display/BEAM/Running+Nexmark>).
>> Will try to debug it and rerun the process.
>>
>>
>> -Rui
>>
>> On Thu, Jul 25, 2019 at 2:39 PM Anton Kedin  wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #3 for the version
>>> 2.14.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint
>>> 89E2FFCAE7E99CF6E6827CFEF7349F2310FFB193 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.14.0-RC1" [5], [6]
>>> * website pull request listing the release [7], publishing the API
>>> reference manual [8].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>> * Validation sheet with a tab for 2.14.0 release to help with validation
>>> [9].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Anton
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345431
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.14.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1080/
>>> [5] https://github.com/apache/beam/tree/v2.14.0-RC1
>>> [6] https://github.com/apache/beam/tags
>>> [7] https://github.com/apache/beam/pull/9157
>>> [8] https://github.com/apache/beam-site/pull/591/
>>> [9] https://s.apache.org/beam-release-validation#gid=1082148452
>>>
>>

[VOTE] Release 2.14.0, release candidate #1

2019-07-25 Thread Anton Kedin

Hi everyone,
Please review and vote on the release candidate #3 for the version 2.14.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
89E2FFCAE7E99CF6E6827CFEF7349F2310FFB193 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.14.0-RC1" [5], [6]
* website pull request listing the release [7], publishing the API
reference manual [8].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.14.0 release to help with validation
[9].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Anton

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345431
[2] https://dist.apache.org/repos/dist/dev/beam/2.14.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1080/
[5] https://github.com/apache/beam/tree/v2.14.0-RC1
[6] https://github.com/apache/beam/tags
[7] https://github.com/apache/beam/pull/9157
[8] https://github.com/apache/beam-site/pull/591/
[9] https://s.apache.org/beam-release-validation#gid=1082148452

Re: [2.14.0] Release Progress Update

2019-07-25 Thread Anton Kedin

Planning to send out the RC1 within the next couple of hours.

Regards,
Anton

On Thu, Jul 25, 2019 at 1:21 PM Pablo Estrada  wrote:

> Hi Anton,
> are there updates on the release?
> Thanks!
> -P.
>
> On Fri, Jul 19, 2019 at 12:33 PM Anton Kedin  wrote:
>
>> Verification build succeeds except for AWS IO (which has tests hanging).
>> I will continue the release process as normal and will investigate the AWS
>> IO issue meanwhile. Will either disable the hanging tests to get the
>> artifacts for an RC or will continue without it temporarily, will need to
>> re-validate it when the issue is resolved.
>>
>> Regards,
>> Anton
>>
>> On Thu, Jul 18, 2019 at 8:54 AM Anton Kedin  wrote:
>>
>>> All cherry-picks are merged, blocker jiras closed, running the
>>> verification build.
>>>
>>> On Mon, Jul 15, 2019 at 4:53 PM Ahmet Altay  wrote:
>>>
>>>> Anton, any updates on this release? Do you need help?
>>>>
>>>> On Fri, Jun 28, 2019 at 11:42 AM Anton Kedin  wrote:
>>>>
>>>>> I have been running validation builds (had some hickups with that),
>>>>> everything looks mostly good, except failures in `:beam-test-tools` and
>>>>> `:io:aws`. Now I will start cherry-picking other fixes and trying to 
>>>>> figure
>>>>> the specific issues out.
>>>>>
>>>>> Regards,
>>>>> Anton
>>>>>
>>>>> On Fri, Jun 21, 2019 at 3:17 PM Anton Kedin  wrote:
>>>>>
>>>>>> Not much progress today. Debugging build issues when running global
>>>>>> `./gradlew build -PisRelease --scan`
>>>>>>
>>>>>> Regards,
>>>>>> Anton
>>>>>>
>>>>>> On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:
>>>>>>
>>>>>>> Published the snapshots, working through the
>>>>>>> verify_release_validation script
>>>>>>>
>>>>>>> Got another blocker to be cherry-picked when merged:
>>>>>>> https://issues.apache.org/jira/browse/BEAM-7603
>>>>>>>
>>>>>>> Regards,
>>>>>>> Anton
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have cut the release branch for 2.14.0 and working through the
>>>>>>>> release process. Next step is building the snapshot and release branch
>>>>>>>> verification.
>>>>>>>>
>>>>>>>> There are two issues [1] that are still not resolved that are
>>>>>>>> marked as blockers at the moment:
>>>>>>>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner
>>>>>>>> broken;
>>>>>>>>  * [3] BEAM-7424 - retries for GCS;
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>>>>>>>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>>>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Anton
>>>>>>>>
>>>>>>>

Re: How to run DynamoDBIOTest?

2019-07-19 Thread Anton Kedin

One of the machines is macOS 10.14.5, docker desktop 2.0.0.3 (engine
18.09.2), java 1.8.0_211 (I believe Oracle version). Test log:
https://gist.github.com/akedin/da6fbc8a993f758302a6f64c42bba11b#file-gistfile1-txt
It
then spins forever with only gradle logs. Another machine I tried it on is
something debian-based, open jdk 8 212, docker 18.09.3, logs are similar
(don't have access to the details at the moment).

Regards,
Anton

On Fri, Jul 19, 2019 at 2:01 PM Ismaël Mejía  wrote:

> This looks weird, I run the build in my machine (Ubuntu linux 18.04 +
> OpenJDK 1.8.0_201) + docker 18.09.8 in both master and the release
> 2.14.0 branch and it executes without issue.
> This test uses a docker image as a sort of embedded server to simulate
> the Amazon backend (localstack).
> All builds were green when merged. Do you get any extra logs Anton?
> what is your OS / Java version?
> Adding Cam to the discussion since he contributed this feature to see
> if he may have any extra context.
>
> On Fri, Jul 19, 2019 at 7:15 PM Anton Kedin  wrote:
> >
> > Hi dev@,
> >
> > Does anyone know if there's anything extra needed to run
> `DynamoDBIOTest`? If I do `./graldew
> :sdks:java:io:amazon-web-services:build --debug` it passes few tests during
> `:test` but then seems to sit on `DynamoDBIOTest` forever. No errors, last
> meaningful log is `INFO: Container localstack/localstack:0.8.6 started`.
> Happens on different machines, both on master and release-2.14.0 branches.
> >
> > Any pointers?
> >
> > Regards,
> > Anton
>

Re: [2.14.0] Release Progress Update

2019-07-19 Thread Anton Kedin

Verification build succeeds except for AWS IO (which has tests hanging). I
will continue the release process as normal and will investigate the AWS IO
issue meanwhile. Will either disable the hanging tests to get the artifacts
for an RC or will continue without it temporarily, will need to re-validate
it when the issue is resolved.

Regards,
Anton

On Thu, Jul 18, 2019 at 8:54 AM Anton Kedin  wrote:

> All cherry-picks are merged, blocker jiras closed, running the
> verification build.
>
> On Mon, Jul 15, 2019 at 4:53 PM Ahmet Altay  wrote:
>
>> Anton, any updates on this release? Do you need help?
>>
>> On Fri, Jun 28, 2019 at 11:42 AM Anton Kedin  wrote:
>>
>>> I have been running validation builds (had some hickups with that),
>>> everything looks mostly good, except failures in `:beam-test-tools` and
>>> `:io:aws`. Now I will start cherry-picking other fixes and trying to figure
>>> the specific issues out.
>>>
>>> Regards,
>>> Anton
>>>
>>> On Fri, Jun 21, 2019 at 3:17 PM Anton Kedin  wrote:
>>>
>>>> Not much progress today. Debugging build issues when running global
>>>> `./gradlew build -PisRelease --scan`
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>> On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:
>>>>
>>>>> Published the snapshots, working through the verify_release_validation
>>>>> script
>>>>>
>>>>> Got another blocker to be cherry-picked when merged:
>>>>> https://issues.apache.org/jira/browse/BEAM-7603
>>>>>
>>>>> Regards,
>>>>> Anton
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:
>>>>>
>>>>>> I have cut the release branch for 2.14.0 and working through the
>>>>>> release process. Next step is building the snapshot and release branch
>>>>>> verification.
>>>>>>
>>>>>> There are two issues [1] that are still not resolved that are marked
>>>>>> as blockers at the moment:
>>>>>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
>>>>>>  * [3] BEAM-7424 - retries for GCS;
>>>>>>
>>>>>> [1]
>>>>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>>>>>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>>>>>
>>>>>> Regards,
>>>>>> Anton
>>>>>>
>>>>>

How to run DynamoDBIOTest?

2019-07-19 Thread Anton Kedin

Hi dev@,

Does anyone know if there's anything extra needed to run `DynamoDBIOTest`?
If I do `./graldew :sdks:java:io:amazon-web-services:build --debug` it
passes few tests during `:test` but then seems to sit on `DynamoDBIOTest`
forever. No errors, last meaningful log is `INFO: Container
localstack/localstack:0.8.6 started`. Happens on different machines, both
on master and release-2.14.0 branches.

Any pointers?

Regards,
Anton

Re: [2.14.0] Release Progress Update

2019-07-18 Thread Anton Kedin

All cherry-picks are merged, blocker jiras closed, running the verification
build.

On Mon, Jul 15, 2019 at 4:53 PM Ahmet Altay  wrote:

> Anton, any updates on this release? Do you need help?
>
> On Fri, Jun 28, 2019 at 11:42 AM Anton Kedin  wrote:
>
>> I have been running validation builds (had some hickups with that),
>> everything looks mostly good, except failures in `:beam-test-tools` and
>> `:io:aws`. Now I will start cherry-picking other fixes and trying to figure
>> the specific issues out.
>>
>> Regards,
>> Anton
>>
>> On Fri, Jun 21, 2019 at 3:17 PM Anton Kedin  wrote:
>>
>>> Not much progress today. Debugging build issues when running global
>>> `./gradlew build -PisRelease --scan`
>>>
>>> Regards,
>>> Anton
>>>
>>> On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:
>>>
>>>> Published the snapshots, working through the verify_release_validation
>>>> script
>>>>
>>>> Got another blocker to be cherry-picked when merged:
>>>> https://issues.apache.org/jira/browse/BEAM-7603
>>>>
>>>> Regards,
>>>> Anton
>>>>
>>>>
>>>> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:
>>>>
>>>>> I have cut the release branch for 2.14.0 and working through the
>>>>> release process. Next step is building the snapshot and release branch
>>>>> verification.
>>>>>
>>>>> There are two issues [1] that are still not resolved that are marked
>>>>> as blockers at the moment:
>>>>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
>>>>>  * [3] BEAM-7424 - retries for GCS;
>>>>>
>>>>> [1]
>>>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>>>>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>>>>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>>>>
>>>>> Regards,
>>>>> Anton
>>>>>
>>>>

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-16 Thread Anton Kedin

Congrats!

On Tue, Jul 16, 2019 at 10:36 AM Ankur Goenka  wrote:

> Congratulations Robert!
>
> Go GO!
>
> On Tue, Jul 16, 2019 at 10:34 AM Rui Wang  wrote:
>
>> Congrats!
>>
>>
>> -Rui
>>
>> On Tue, Jul 16, 2019 at 10:32 AM Udi Meiri  wrote:
>>
>>> Congrats Robert B.!
>>>
>>> On Tue, Jul 16, 2019 at 10:23 AM Ahmet Altay  wrote:
>>>
 Hi,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Robert Burke.

 Robert has been contributing to Beam and actively involved in the
 community for over a year. He has been actively working on Go SDK, helping
 users, and making it easier for others to contribute [1].

 In consideration of Robert's contributions, the Beam PMC trusts him
 with the responsibilities of a Beam committer [2].

 Thank you, Robert, for your contributions and looking forward to many
 more!

 Ahmet, on behalf of the Apache Beam PMC

 [1]
 https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
 [2] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer

>>>

Re: [2.14.0] Release Progress Update

2019-06-28 Thread Anton Kedin

I have been running validation builds (had some hickups with that),
everything looks mostly good, except failures in `:beam-test-tools` and
`:io:aws`. Now I will start cherry-picking other fixes and trying to figure
the specific issues out.

Regards,
Anton

On Fri, Jun 21, 2019 at 3:17 PM Anton Kedin  wrote:

> Not much progress today. Debugging build issues when running global
> `./gradlew build -PisRelease --scan`
>
> Regards,
> Anton
>
> On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:
>
>> Published the snapshots, working through the verify_release_validation
>> script
>>
>> Got another blocker to be cherry-picked when merged:
>> https://issues.apache.org/jira/browse/BEAM-7603
>>
>> Regards,
>> Anton
>>
>>
>> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:
>>
>>> I have cut the release branch for 2.14.0 and working through the release
>>> process. Next step is building the snapshot and release branch verification.
>>>
>>> There are two issues [1] that are still not resolved that are marked as
>>> blockers at the moment:
>>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
>>>  * [3] BEAM-7424 - retries for GCS;
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>>
>>> Regards,
>>> Anton
>>>
>>

Re: Change of Behavior - JDBC Set Command

2019-06-27 Thread Anton Kedin

I think we thought about this approach but decided to get rid of the map
representation wherever we can while still supporting setting of the
options by name.

One of the lesser important downsides of keeping the map around is that we
will need to do `fromArgs` at least twice.

Another downside is that we will probably have to keep and maintain two
representations of the pipeline options at the same time and have extra
validations and probably reconciliation logic.

We need the map representation in the JDBC/command-line use case where it's
the only way for a user to specify the options. A user runs a special SQL
command which goes through normal parsing and execution logic. On top of
that we have a case of mixed Java/SQL pipelines, where we already have an
instance of PipelineOptions and don't need a user to set the options from
within a query. Right now this is impossible for other reasons as well. But
to support both JDBC and Java+SQL use cases we currently pass both a map
and a PipelineOptions instance around. Which makes things confusing. We can
probably reduce passing things around but I think we will still need to
keep both representations.

Ideally, I think, mixed Java+SQL pipelines should be backed by that same
JDBC logic as much as possible. So potentially we should allow users to set
the pipeline options from within a complicated query even in SqlTransform
in a Java pipeline. However setting an option from within SQL persists it
in the map, but in mixed case we already have the PipelineOption instance
that we got from the SqlTransform. So now we will need to maintain the
logic to reconcile the two representations. That will probably involve
either something similar to the proposed reflection approach, or
serializing both representations to a map or JSON and then reconciling and
then reconstructing it from there. This sounds unnecessary and we can avoid
this if we are able to just set the pipeline options by name in the first
place. In that case we can just use whatever PipelineOptions instance we
have at the moment without extra validation / reconciliation.

Hope this makes sense.

Regards,
Anton

On Thu, Jun 27, 2019 at 4:38 PM Lukasz Cwik  wrote:

> Not sure, based upon the JIRA description it seems like you want early
> validation of PipelineOptions. Couldn't you maintain the map of pipeline
> options and every time one is added call PipelineOptionsFactory.fromArgs
> discarding the result just for the error checking?
>
> On Tue, Jun 25, 2019 at 10:12 AM Alireza Samadian 
> wrote:
>
>> Not sure. One solution might be moving the
>> PipelineOptionsReflectionSetter class to SQL package and make it package
>> private. This will prevent the exposure but the downside would be I need to
>> make PipelineOptionsFactory.parseObjects() public or duplicate its code. Do
>> you think this approach might be better? I would also appreciate if you
>> have another suggestion to solve this.
>>
>> Best,
>> Alireza
>>
>> On Tue, Jun 25, 2019 at 8:40 AM Lukasz Cwik  wrote:
>>
>>> That makes sense. I took a look at your PR, is there a way to do it
>>> without exposing the reflection capabilities to pipeline authors?
>>>
>>> On Mon, Jun 24, 2019 at 2:20 PM Alireza Samadian 
>>> wrote:
>>>
 Hi all,

 I am writing to ask if it is OK to slightly change the behaviour of SET
 command in JDBC connection of Beam SQL. Currently, if we try to use set
 command for an option that does not exist or setting an option to an
 illegal value, it does not show any error until we run a query. This means
 one can potentially set it incorrectly and then reset it correctly and run
 query without getting any error. However, I want to make some changes in
 JDBC Driver that causes this behavior to be changed. After this change, if
 someone tries to use set command for a wrong pipeline option (in JDBC
 path), it will immediately see an error message.

 The reason for this change is because I am working on the Jira issue
 https://issues.apache.org/jira/projects/BEAM/issues/BEAM-7590, and I
 am removing the Pipeline Option Map representation and keep the actual
 pipeline options instead. As a result, each time that the set command is
 called, it will try to change the pipeline options instance using
 reflection instead of changing a map, and later constructing pipeline
 options from it.

 The following is a link to the pull request:
 https://github.com/apache/beam/pull/8928

 Best,
 Alireza Samadian

>>>

Spotless exclusions

2019-06-26 Thread Anton Kedin

Currently our spotless is configured globally [1] (for java at least) to
include all source files by '**/*.java'. And then we exclude things
explicitly. Don't know why, but these exclusions are ignored for me
sometimes, for example `./gradlew :sdks:java:core:spotlessJavaCheck` always
fails when checking the generated files under
`.../build/generated-src/antlr/main/org/apache/beam/sdk/schemas/parser/generated`.

Few questions:
 * can someone point me to a discussion or a jira about this behavior?
 * do we actually have a use case of checking the source files that are not
under 'src'?
 * if not, can we switch the config to only check for sources under 'src'
[2]?
 * alternatively, would it make sense to introduce project-specific
overrides?

[1]
https://github.com/apache/beam/blob/af9362168606df9ec11319fe706b72466413798c/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L819
[2] https://github.com/apache/beam/pull/8954

Golang dependencies in .test-infra/tools

2019-06-25 Thread Anton Kedin

Hi,

I am trying to verify the release and seeing failures when running
`./gradlew :beam-test-tools:build` (it is run as part of the global build).
The problem seems to be that it fails to cache one of the dependencies:

```
.gogradle/project_gopath/src/
github.com/apache/beam/.test-infra/tools/vendor/go.opencensus.io/trace/lrumap.go:18:2:
cannot find package "github.com/hashicorp/golang-lru/simplelru" in any of:
...
```

It is able to find the `lrumap` and `simplelru` during the dependency
resolution step, and I can see it mentioned in couple of artifacts produced
by the `gogradle` plugin. But when it does `:installDepedencies` to
actually copy them to `vendor` directory, this specific package is missing.
This reproduces for me on a couple of different machines I tried, both on
release and master branches. I can't seem to find a relevant recent change
or pinpoint why exactly this happens in the plugin.

Does anyone have a clue what can be causing this and how to fix it?

Regards,
Anton

Re: [2.14.0] Release Progress Update

2019-06-21 Thread Anton Kedin

Not much progress today. Debugging build issues when running global
`./gradlew build -PisRelease --scan`

Regards,
Anton

On Thu, Jun 20, 2019 at 4:12 PM Anton Kedin  wrote:

> Published the snapshots, working through the verify_release_validation
> script
>
> Got another blocker to be cherry-picked when merged:
> https://issues.apache.org/jira/browse/BEAM-7603
>
> Regards,
> Anton
>
>
> On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:
>
>> I have cut the release branch for 2.14.0 and working through the release
>> process. Next step is building the snapshot and release branch verification.
>>
>> There are two issues [1] that are still not resolved that are marked as
>> blockers at the moment:
>>  * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
>>  * [3] BEAM-7424 - retries for GCS;
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
>> [2] https://issues.apache.org/jira/browse/BEAM-7478
>> [3] https://issues.apache.org/jira/browse/BEAM-7424
>>
>> Regards,
>> Anton
>>
>

Re: [ANNOUNCE] New committer: Mikhail Gryzykhin

2019-06-21 Thread Anton Kedin

Congrats!

On Fri, Jun 21, 2019 at 3:55 AM Reza Rokni  wrote:

> Congratulations!
>
> On Fri, 21 Jun 2019, 12:37 Robert Burke,  wrote:
>
>> Congrats
>>
>> On Fri, Jun 21, 2019, 12:29 PM Thomas Weise  wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Mikhail Gryzykhin.
>>>
>>> Mikhail has been contributing to Beam and actively involved in the
>>> community for over a year. He developed the community build dashboard [1]
>>> and added substantial improvements to our build infrastructure. Mikhail's
>>> work also covers metrics, contributor documentation, development process
>>> improvements and other areas.
>>>
>>> In consideration of Mikhail's contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer [2].
>>>
>>> Thank you, Mikhail, for your contributions and looking forward to many
>>> more!
>>>
>>> Thomas, on behalf of the Apache Beam PMC
>>>
>>> [1] https://s.apache.org/beam-community-metrics
>>> [2]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>>

Re: [2.14.0] Release Progress Update

2019-06-20 Thread Anton Kedin

Published the snapshots, working through the verify_release_validation
script

Got another blocker to be cherry-picked when merged:
https://issues.apache.org/jira/browse/BEAM-7603

Regards,
Anton


On Wed, Jun 19, 2019 at 4:17 PM Anton Kedin  wrote:

> I have cut the release branch for 2.14.0 and working through the release
> process. Next step is building the snapshot and release branch verification.
>
> There are two issues [1] that are still not resolved that are marked as
> blockers at the moment:
>  * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
>  * [3] BEAM-7424 - retries for GCS;
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
> [2] https://issues.apache.org/jira/browse/BEAM-7478
> [3] https://issues.apache.org/jira/browse/BEAM-7424
>
> Regards,
> Anton
>

[2.14.0] Release Progress Update

2019-06-19 Thread Anton Kedin

I have cut the release branch for 2.14.0 and working through the release
process. Next step is building the snapshot and release branch verification.

There are two issues [1] that are still not resolved that are marked as
blockers at the moment:
 * [2] BEAM-7478 - remote cluster submission from Flink Runner broken;
 * [3] BEAM-7424 - retries for GCS;

[1]
https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20%20AND%20fixVersion%20%3D%202.14.0%20AND%20status%20!%3D%20Closed%20AND%20status%20!%3DResolved
[2] https://issues.apache.org/jira/browse/BEAM-7478
[3] https://issues.apache.org/jira/browse/BEAM-7424

Regards,
Anton

Re: [Final Reminder] Beam 2.14 release branch will be cut tomorrow at 6pm UTC

2019-06-19 Thread Anton Kedin

Makes sense. I will do the same then - will cut the release branch and wait
to cherry-pick the fixes.

Regards,
Anton

On Tue, Jun 18, 2019 at 10:25 PM Ismaël Mejía  wrote:

> Cutting the next release branch is not equal to starting the release
> vote. In the past we have cut the branch even if there are still open
> issues and then give people some days to trim their issues.
>
> So the release manager should create the release branch in the
> specified date and sync with the people working on the open issues so
> they cherry pick their PRs in the release branch if needed or move
> them to the next release and start the vote ONLY when the open issue
> list [1] count gets down to 0.
>
> Note: We can propose a different alternative but this has been
> effective in the past and gives contributors time to fix things to
> solve critical/blocker issues or issues that somehow need to be
> synced/discussed. Creating new RCs is ‘long’ and not yet 100%
> automated (so error-prone), also votes take long, so the less
> RCs/votes we have to do the better:
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0
>
>
> On Wed, Jun 19, 2019 at 3:19 AM Chamikara Jayalath 
> wrote:
> >
> >
> >
> > On Tue, Jun 18, 2019 at 6:00 PM Anton Kedin  wrote:
> >>
> >> What is the right thing to do if it is not ready by the proposed branch
> cut time tomorrow? I don't think the Jira issue provides enough context
> about the severity of the problem and why it has to go out specifically in
> 2.14.0. Without additional context I think the expected path forward should
> look like this:
> >> * if it's a regression or something that really needs to block the
> release then I think more information about the problem is needed;
> >
> >
> > Context is that GCS may start throttling some of the requests and
> raising 429 errors so Beam should implement logic for retrying such
> failures with exponential backoff. Java SDK is already handling such
> failures correctly. +Heejong Lee is actively working on a fix for Python
> SDK. I believe this will be a relatively small change and a PR should be
> available within a day or so. We can also try to cherry-pick the fix to
> release branch after it is cut if you want to go ahead with the scheduled
> branch cut time.
> >
> > Thanks,
> > Cham
> >
> >>
> >> * if it's not a regression, proceed with the release even without the
> fix;
> >> * if the fix is ready before the release is completed, consider
> cherry-picking and re-doing the appropriate steps of the release process;
> >> * if the fix is not ready, consider doing a follow-up 2.14.1 release;
> >> * otherwise delay until 2.15.0;
> >>
> >> Regards,
> >> Anton
> >>
> >>
> >> On Tue, Jun 18, 2019 at 4:37 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>>
> >>> Please note that https://issues.apache.org/jira/browse/BEAM-7424 was
> marked as a blocker and we'd like to get the fix to Python SDK into the
> 2.14 release.
> >>>
> >>> Thanks,
> >>> Cham
> >>>
> >>> On Tue, Jun 18, 2019 at 4:16 PM Anton Kedin  wrote:
> >>>>
> >>>> It's a reminder, I am planning to cut the release branch tomorrow, on
> Wednesday, June 19, at 11am PDT (Seattle local time, corresponds to
> [19:00@GMT+1] and [18:00@UTC]). Please make sure all the code you want in
> the release is submitted by that time, and that all blocking Jiras have the
> release version attached.
> >>>>
> >>>> Thank you,
> >>>> Anton
> >>>>
> >>>> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> >>>> [2]
> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0
>

Re: [Final Reminder] Beam 2.14 release branch will be cut tomorrow at 6pm UTC

2019-06-18 Thread Anton Kedin

What is the right thing to do if it is not ready by the proposed branch cut
time tomorrow? I don't think the Jira issue provides enough context about
the severity of the problem and why it has to go out specifically in
2.14.0. Without additional context I think the expected path forward should
look like this:
* if it's a regression or something that really needs to block the release
then I think more information about the problem is needed;
* if it's not a regression, proceed with the release even without the fix;
* if the fix is ready before the release is completed, consider
cherry-picking and re-doing the appropriate steps of the release process;
* if the fix is not ready, consider doing a follow-up 2.14.1 release;
* otherwise delay until 2.15.0;

Regards,
Anton

On Tue, Jun 18, 2019 at 4:37 PM Chamikara Jayalath 
wrote:

> Please note that https://issues.apache.org/jira/browse/BEAM-7424 was
> marked as a blocker and we'd like to get the fix to Python SDK into the
> 2.14 release.
>
> Thanks,
> Cham
>
> On Tue, Jun 18, 2019 at 4:16 PM Anton Kedin  wrote:
>
>> It's a reminder, I am planning to cut the release branch tomorrow, on
>> Wednesday, June 19, at 11am PDT (Seattle local time, corresponds to
>> [19:00@GMT+1] and [18:00@UTC]). Please make sure all the code you want
>> in the release is submitted by that time, and that all blocking Jiras have
>> the release version attached.
>>
>> Thank you,
>> Anton
>>
>> [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> [2]
>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0
>>
>

[Final Reminder] Beam 2.14 release branch will be cut tomorrow at 6pm UTC

2019-06-18 Thread Anton Kedin

It's a reminder, I am planning to cut the release branch tomorrow, on
Wednesday, June 19, at 11am PDT (Seattle local time, corresponds to
[19:00@GMT+1] and [18:00@UTC]). Please make sure all the code you want in
the release is submitted by that time, and that all blocking Jiras have the
release version attached.

Thank you,
Anton

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0

Re: GitHub checks not running

2019-06-17 Thread Anton Kedin

They are getting triggered now.

On Mon, Jun 17, 2019 at 9:10 AM Anton Kedin  wrote:

> Hi dev@,
>
> Does anyone has context on why the checks might not get triggered on pull
> requests today? E.g. https://github.com/apache/beam/pull/8822
>
> Regards,
> Anton
>

GitHub checks not running

2019-06-17 Thread Anton Kedin

Hi dev@,

Does anyone has context on why the checks might not get triggered on pull
requests today? E.g. https://github.com/apache/beam/pull/8822

Regards,
Anton

[Reminder] Beam 2.14 Release to be cut on Wed, June 19 at 6pm UTC

2019-06-17 Thread Anton Kedin

It's a reminder, I am planning to cut the release branch on Wednesday, June
19, at 11am PDT (Seattle local time, corresponds to [19:00@GMT+1] and
[18:00@UTC]). Please make sure all the code you want in the release is
submitted by that time, and that all blocking Jiras have the release
version attached.

Thank you,
Anton

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0

[SQL] Let's split the TableProvider

2019-06-14 Thread Anton Kedin

Hi dev@, and especially anyone interested in SQL,

We have an interface called TableProvider (and some other related classes)
in Beam SQL that manages how we resolve the table schemas, construct IOs
and do other related and unrelated things when parsing the queries. At the
moment it feels very overloaded and not easy to use or extend. I propose we
split it into few more abstractions. Here's an initial draft doc for
discussion, let me know what you think:

https://docs.google.com/document/d/1QAPz74XMctCsiUnutWR1ejEXjKGmpSQJ61qtfAiBE4E

Regards,
Anton

[Reminder] Beam 2.14.0 Release Soon

2019-06-12 Thread Anton Kedin

Reminder, the plan is to cut the branch a week from now, on June 19th.
Please mark all release blocking issues with fix version 2.14.

Thank you,
Anton

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0

Re: [DISCUSS] Portability representation of schemas

2019-06-07 Thread Anton Kedin

The topic of schema registries probably does not block the design and
implementation of logical types and portable schemas by themselves, however
I think we should spend some time discussing it (probably in a separate
thread) so that all SDKs have similar mechanisms for schema registration
and lookup.
Current Java SDK allows registering schemas for Java-types of the elements
enabling automatic conversions from Pojos/AutoValues/etc to Rows. This
approach is helpful within Java SDK but it will need to be generalized and
extended. E.g. it should allow the lookup of schemas/types using some other
logic (customizable), not just Java type of the elements, or maybe even
dynamic schemas (not just Union, don't know if there is a use case for
this). This should also include an understanding of how external
schema/metadata sources (Hive Metastore, Data Catalog) can be used in
different SDKs.
And maybe some general reflection mechanisms?

Regards,
Anton


On Fri, Jun 7, 2019 at 4:35 AM Robert Burke  wrote:

> Wouldn't SDK specific types always be under the "coders" component instead
> of the logical type listing?
>
> Offhand, having a separate normalized listing of logical schema types in
> the pipeline components message of the types seems about right. Then
> they're unambiguous, but can also either refer to other logical types or
> existing coders as needed. When SDKs don't understand a given coder, the
> field could be just represented by a blob of bytes.
>
>
>
> On Wed, Jun 5, 2019, 11:29 PM Brian Hulette  wrote:
>
>> If we want to have a Pipeline level registry, we could add it to
>> Components [1].
>>
>> message Components {
>>   ...
>>   map logical_types;
>> }
>>
>> And in FieldType reference the logical types by id:
>> oneof field_type {
>>   AtomicType atomic_type;
>>   ArrayType array_type;
>>   ...
>>   string logical_type_id;// was LogicalType logical_type;
>> }
>>
>> I'm not sure I like this idea though. The reason we started discussing a
>> "registry" was just to separate the SDK-specific bits from the
>> representation type, and this doesn't accomplish that, it just de-dupes
>> logical types used
>> across the pipeline.
>>
>> I think instead I'd rather just come back to the message we have now in
>> the doc, used directly in FieldType's oneof:
>>
>> message LogicalType {
>>   FieldType representation = 1;
>>   string logical_urn = 2;
>>   bytes logical_payload = 3;
>> }
>>
>> We can have a URN for SDK-specific types (user type aliases), like
>> "beam:logical:javasdk", and the logical_payload could itself be a protobuf
>> with attributes of 1) a serialized class and 2/3) to/from functions. For
>> truly portable types it would instead have a well-known URN and optionally
>> a logical_payload with some agreed-upon representation of parameters.
>>
>> It seems like maybe SdkFunctionSpec/Environment should be used for this
>> somehow, but I can't find a good example of this in the Runner API to use
>> as a model. For example, what we're trying to accomplish is basically the
>> same as Java custom coders vs. standard coders. But that is accomplished
>> with a magic "javasdk" URN, as I suggested here, not with Environment
>> [2,3]. There is a "TODO: standardize such things" where that URN is
>> defined, is it possible that Environment is that standard and just hasn't
>> been utilized for custom coders yet?
>>
>> Brian
>>
>> [1]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L54
>> [2]
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542
>> [3]
>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/CoderTranslation.java#L121
>>
>> On Tue, Jun 4, 2019 at 2:24 PM Brian Hulette  wrote:
>>
>>> Yeah that's what I meant. It does seem logical reasonable to scope any
>>> registry by pipeline and not by PCollection. Then it seems we would want
>>> the entire LogicalType (including the `FieldType representation` field) as
>>> the value type, and not just LogicalTypeConversion. Otherwise we're
>>> separating the representations from the conversions, and duplicating the
>>> representations. You did say a "registry of logical types", so maybe that
>>> is what you meant.
>>>
>>> Brian
>>>
>>> On Tue, Jun 4, 2019 at 1:21 PM Reuven Lax  wrote:
>>>


 On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette 
 wrote:

>
>
> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax  wrote:
>
>>
>>
>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette 
>> wrote:
>>
>>> > It has to go into the proto somewhere (since that's the only way
>>> the SDK can get it), but I'm not sure they should be considered integral
>>> parts of the type.
>>> Are you just advocating for an approach where any SDK-specific
>>> information is stored outside of the Schema message itself so that 
>>> Schema
>>> really does

Re: [PROPOSAL] Preparing for Beam 2.14.0 release

2019-06-06 Thread Anton Kedin

I don't know, haven't followed the docker images release thread. Will take
a look and see if it's feasible or is a blocker for this release.

Regards,
Anton

On Thu, Jun 6, 2019 at 12:41 PM Ismaël Mejía  wrote:

> Are you planning to release also the docker images that were postponed in
> the previous release? If so probably starting early to define that part of
> the  process will be a good idea.
>
> On Thu, Jun 6, 2019, 7:06 PM Jean-Baptiste Onofré  wrote:
>
>> +1
>>
>> Regards
>> JB
>> Le 6 juin 2019, à 19:02, Ankur Goenka  a écrit:
>>>
>>> +1
>>>
>>> On Thu, Jun 6, 2019, 9:13 AM Ahmet Altay  wrote:
>>>
>>>> +1, thank you for keeping the cadence.
>>>>
>>>> On Thu, Jun 6, 2019 at 9:04 AM Anton Kedin  wrote:
>>>>
>>>>> Hello Beam community!
>>>>>
>>>>> Beam 2.14 release branch cut date is June 19 according to the release
>>>>> calendar [1]. I would like to volunteer myself to do this release. The 
>>>>> plan
>>>>> is to cut the branch on that date, and cherrypick fixes if needed.
>>>>>
>>>>> If you have release blocking issues for 2.14 please mark their "Fix
>>>>> Version" as 2.14.0 [2]. Please use 2.15.0 release in JIRA in case you
>>>>> would like to move any non-blocking issues to that version.
>>>>>
>>>>> And if we're doing a 2.7.1 release it should probably happen
>>>>> independently and in parallel if we want to maintain the release cadence.
>>>>>
>>>>> Thoughts, comments, objections?
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>> [1]
>>>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>>>> [2]
>>>>> https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0
>>>>>
>>>>

[PROPOSAL] Preparing for Beam 2.14.0 release

2019-06-06 Thread Anton Kedin

Hello Beam community!

Beam 2.14 release branch cut date is June 19 according to the release
calendar [1]. I would like to volunteer myself to do this release. The plan
is to cut the branch on that date, and cherrypick fixes if needed.

If you have release blocking issues for 2.14 please mark their "Fix
Version" as 2.14.0 [2]. Please use 2.15.0 release in JIRA in case you would
like to move any non-blocking issues to that version.

And if we're doing a 2.7.1 release it should probably happen independently
and in parallel if we want to maintain the release cadence.

Thoughts, comments, objections?

Thanks,
Anton

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2]
https://issues.apache.org/jira/browse/BEAM-7478?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20fixVersion%20%3D%202.14.0

Re: 1 Million Lines of Code (1 MLOC)

2019-05-31 Thread Anton Kedin

And to reduce the effort of future rewrites we should start doing it on a
schedule. I propose we start over once a week :)

On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik  wrote:

> 1 million lines is too much, time to delete the entire project and start
> over again, :-)
>
> On Fri, May 31, 2019 at 3:12 PM Ankur Goenka  wrote:
>
>> Thanks for sharing.
>> This is really interesting metrics.
>> One use I can see is to track LOC vs Comments to make sure that we keep
>> up with the practice of writing maintainable code.
>>
>> On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía  wrote:
>>
>>> I was checking some metrics in our codebase and found by chance that
>>> we have passed the 1 million lines of code (MLOC). Of course lines of
>>> code may not matter much but anyway it is interesting to see the size
>>> of our project at this moment.
>>>
>>> This is the detailed information returned by loc [1]:
>>>
>>>
>>> 
>>>  Language FilesLinesBlank  Comment
>>>Code
>>>
>>> 
>>>  Java  3681   67300778265   140753
>>>  453989
>>>  Python 497   1310822256013378
>>>   95144
>>>  Go 333   1057751368111073
>>>   81021
>>>  Markdown   20531989 65260
>>>   25463
>>>  Plain Text  1121979 63590
>>>   15620
>>>  Sass92 9867 1434 1900
>>>6533
>>>  JavaScript  19 5157 1197  467
>>>3493
>>>  YAML14 4601  454 1104
>>>3043
>>>  Bourne Shell30 3874  470 1028
>>>2376
>>>  Protobuf17 4258  677 1373
>>>2208
>>>  XML 17 2789  296  559
>>>1934
>>>  Kotlin  19 3501  347 1370
>>>1784
>>>  HTML60 2447  148  914
>>>1385
>>>  Batch3  249   570
>>> 192
>>>  INI  1  206   21   16
>>> 169
>>>  C++  2   724   36
>>>  32
>>>  Autoconf 1   211   16
>>>   4
>>>
>>> 
>>>  Total 5002  1000874   132497   173987
>>>  694390
>>>
>>> 
>>>
>>> [1] https://github.com/cgag/loc
>>>
>>

Re: SqlTransform Metadata

2019-05-14 Thread Anton Kedin

Reza, can you share more thoughts on how you think this can work
end-to-end?

Currently the approach is that populating the rows with the data happens
before the SqlTransform, and within the query you can only use the
things that are already in the rows or in the catalog/schema (or built-in
things). In general case populating the rows with any data can be solved
via a ParDo before SqlTransform. Do you think this approach lacks something
or maybe too verbose?

My thoughts on this, lacking more info or concrete examples: in order to
access a timestamp value from within a query there has to be a syntax for
it. Field access expressions or function calls are the only things that
come to mind among existing syntax features that would allow that. Making
timestamp a field of the data row makes more sense to me here because in
Beam it is already a part of the element. It's not a result of a function
call and it's already easily accessible, doesn't make sense to build extra
functions here. One of the problems with both approaches however is the
potential conflicts with the existing schema of the data elements (or the
schema/catalog of the data source in general). E.g. if we add a magical
"event_timestamp" column or "event_timestamp()" function there may
potentially already exist a field or a function in the schema with this
name. This can be solved in couple of ways, but we will probably want to
provide a configuration mechanism to assign a different field/function
names in case of conflicts.

Given that, it may make sense to allow users to attach the whole pane info
or some subset of it to the row (e.g. only the timestamp), and make that
configurable. However I am not sure whether exposing something like pane
info is enough and will cover a lot of useful cases. Plus adding methods
like `attachTimestamp("fieldname")` or `attachWindowInfo("fieldname")`
might open a portal to ever-increasing collection of these `attachX()`,
`attachY()` that can make the API less usable. If on the other hand we
would make it more generic then it will probably have to look a lot like a
ParDo or MapElements.via() anyway. And at that point the question would be
whether it makes sense to build something extra that probably looks and
functions like an existing feature.

Regards,
Anton



*From: *Andrew Pilloud 
*Date: *Tue, May 14, 2019 at 7:29 AM
*To: *dev

Hi Reza,
>
> Where will this metadata be coming from? Beam SQL is tightly coupled with
> the schema of the PCollection, so adding fields not in the data would be
> difficult.
>
> If what you want is the timestamp out of the DoFn.ProcessContext we might
> be able to add a SQL function to fetch that.
>
> Andrew
>
> *From: *Reza Rokni 
> *Date: *Tue, May 14, 2019, 1:08 AM
> *To: * 
>
> Hi,
>>
>> What are folks thoughts about adding something like
>> SqlTransform.withMetadata().query(...)to enable users to be able to
>> access things like Timestamp information from within the query without
>> having to refiy the information into the element itself?
>>
>> Cheers
>> Reza
>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: Unexpected behavior of StateSpecs

2019-05-09 Thread Anton Kedin

Does it look similar to https://issues.apache.org/jira/browse/BEAM-6813 ? I
also stumbled on a problem with a state in DirectRunner but wasn't able to
figure it out yet:
https://lists.apache.org/thread.html/dae8b605a218532c085a0eea4e71338eae51922c26820f37b24875c0@%3Cdev.beam.apache.org%3E

Regards,
Anton

*From: *Jan Lukavský 
*Date: *Thu, May 9, 2019 at 8:13 AM
*To: * 

Because of the use of hashCode in StateSpecs, I'd say that it is. But it is
> not obvious. That's why I'd suggest to make it abstract on Coder, so that
> all implementations have to override it. That's a simple solution, but the
> question is - should hashCode of Coder be used that way? I think that
> StateSpec instances should be equal only to itself. Then the hashCode can
> be stored in the instance, e.g.
>
>   private final int hashCode = System.identityHashCode(this)
>
> and returned in hashCode(). There would be no need for Coder to implement
> hashCode anymore (if there aren't any other cases, where it is needed, in
> which case it would still be better to add abstract hashCode and equals
> methods on Coder).
>
> Jan
> On 5/9/19 5:04 PM, Reuven Lax wrote:
>
> Is a valid hashCode on Coder part of our contract or not? If it is, then
> the lack of hashCode on SchemaCoder is simply a bug.
>
> On Thu, May 9, 2019 at 7:42 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> I have spent several hour digging into strange issue with DirectRunner,
>> that manifested as non-deterministic run of pipeline. The pipeline
>> contains basically only single stateful ParDo, which adds elements into
>> state and after some timeout flushes these elements into output. The
>> issues was, that sometimes (very often) when the timer fired, the state
>> appeared to be empty, although I actually added something into the
>> state. I will skip details, but the problem boils down to the fact, that
>> StateSpecs hash Coder into hashCode - e.g.
>>
>>  @Override
>>  public int hashCode() {
>>return Objects.hash(getClass(), coder);
>>  }
>>
>> in ValueStateSpec. Now, when Coder doesn't have hashCode and equals
>> implemented (and there are some of those in the codebase itself - e.g.
>> SchemaCoder), it all blows up in a very hard-to-debug manner. So the
>> proposal is - either to add abstract hashCode and equals to Coder, or
>> don't hash the Coder into hashCode of StateSpecs (we can generate unique
>> ID for each StateSpec instance for example).
>>
>> Any thoughts about which path to follow? Or maybe both? :)
>>
>> Jan
>>
>>
>>

Re: Pipeline options validation

2019-04-30 Thread Anton Kedin

Java8 Optional is not serializable. I think this may be a blocker. Or not?

Regards,
Anton

On Tue, Apr 30, 2019 at 12:18 PM Lukasz Cwik  wrote:

> The migration to requiring @Nullable on methods that could take/return
> null didn't update PipelineOptions contract and its validation to respect
> it.
> We could start using Optional but can't enforce requiring @Nullable since
> it is likely backwards incompatible and would break people's current usage
> of it.
>
> Also, @Nullable is still different from @Validation.Required
> if @Validation.Required was extended to represent whether something was
> explicitly set since in the non object usecase since users may be required
> to specify values for simple types like int instead of taking the default
> the language provides.
>
>
>
> On Tue, Apr 30, 2019 at 9:27 AM Ning Wang  wrote:
>
>> Interesting to know it needs to be an object. Thanks. I will try it.
>>
>> Agree with Kenneth though that Option might be more expected as an user.
>>
>> On Mon, Apr 29, 2019 at 7:16 PM Kenneth Knowles  wrote:
>>
>>> Does it make use of the @Nullable annotation or just assume any object
>>> reference could be null? Now that we are on Java 8 can it use Optional as
>>> well? (pet issue of mine :-)
>>>
>>> On Mon, Apr 29, 2019 at 5:29 PM Lukasz Cwik  wrote:
>>>
 The original ask for having the ability to introspect whether a field
 is set or not was in BEAM-2261 and it was to improve the logic around
 default values.

 I filed BEAM-7180 for making validation check if the field is set or
 not vs the current comparison which is null or not.

 On Mon, Apr 29, 2019 at 5:21 PM Lukasz Cwik  wrote:

> Kyle your right and it makes sense from the doc but from a user point
> of view the validation is really asking if the field has been set or not.
> Differentiation between unset and set has come up in the past for
> PipelineOptions.
>
> On Mon, Apr 29, 2019 at 5:19 PM Kyle Weaver 
> wrote:
>
>> Validation.Required: "This criteria specifies that the value must be
>> not null. Note that this annotation should only be applied to methods
>> that return nullable objects." [1]
>>
>> My guess is you should probably try the Integer class instead.
>>
>> [1]
>> https://github.com/apache/beam/blob/451af5133bc0a6416afa7b1844833c153f510181/sdks/java/core/src/main/java/org/apache/beam/sdk/options/Validation.java#L33-L34
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com | +1650203
>>
>> On Mon, Apr 29, 2019 at 5:12 PM Ning Wang 
>> wrote:
>> >
>> > Hi, Beam devs,
>> >
>> > I am working on a runner and found something not working as
>> expected.
>> >
>> > I have this field in my H*PipelineOptions,
>> > ```
>> >   @Description("Number of Containers")
>> >   @Validation.Required
>> >   int getNumberOfContainers();
>> >   void setNumberOfContainers(int value);
>> > ```
>> > and I am calling this validation function,
>> > ```
>> > H*PipelineOptions options =
>> > PipelineOptionsValidator.validate(H*PipelineOptions.class,
>> opts);
>> > ```
>> >
>> > I am expecting that if --numberOfContainer is missing in command
>> line, there should be an error, however it seems like the value is set 
>> to 0
>> by default.
>> >
>> > Is this the expected behavior? Or is there anything missing? My
>> Beam version is 2.11.0.
>> >
>> > Thanks in advance!
>> > --ning
>> >
>>
>

Re: Sharing plan to support complex equi-join condition in BeamSQL

2019-04-26 Thread Anton Kedin

Thank you for sharing this. This is a great overview. Left few comments in
the doc.

Regards,
Anton

On Fri, Apr 26, 2019 at 10:12 AM Rui Wang  wrote:

> Hi Community,
>
> TL;DR:
>
> BeamSQL only supports equi-join, and its join condition can only be forms
> of `col_a = col_b` or `col_a = col_b AND ...`. I come up with a doc[1] to
> describe how to support complex equi-join condition in BeamSQL, along with
> JIRAs on each sub task.
>
>
> -Rui
>
> [1]
> https://docs.google.com/document/d/1vDiE4HR5ZdbZypIf1vzyFy9yKmAMWBu6BzBKfP7JhPc/edit?usp=sharing
>

Re: [PROPOSAL] Preparing for Beam 2.13.0 release

2019-04-26 Thread Anton Kedin

Following Ankur's link I see a "[+]GoogleCalendar" button in the bottom
right corner of the page. Clicking it opens the google calendar and prompts
to add the Beam Calendar (at least in Chrome). Ismael, do you have a
similar button in your case?

[image: image.png]

Regards,
Anton


On Fri, Apr 26, 2019 at 5:07 AM Ismaël Mejía  wrote:

> Ankur, do you have the equivalent link that I can use to subscribe to
> that calendar via google calendars?
> The link seems to work only to see the calendar in a webpage.
>
> Thanks.
>
> On Fri, Apr 26, 2019 at 1:42 PM Maximilian Michels  wrote:
> >
> > Hi Ankur,
> >
> > Sounds good. This will ensure that we stay on track regarding the
> > release cycle.
> >
> > Thanks,
> > Max
> >
> > On 26.04.19 02:59, Ankur Goenka wrote:
> > > Correction, The planned cut date is May 8th.
> > >
> > > On Thu, Apr 25, 2019 at 4:24 PM Ankur Goenka  > > > wrote:
> > >
> > > Hello Beam community!
> > >
> > > Beam 2.13 release branch cut date is April 8th according to the
> > > release calendar [1]. I would like to volunteer myself to do this
> > > release. I intend to cut the branch as planned on April 8th and
> > > cherrypick fixes if needed.
> > >
> > > If you have releasing blocking issues for 2.13 please mark their
> > > "Fix Version" as 2.13.0. Please use 2.14.0 release in JIRA in case
> > > you would like to move any non-blocking issues to that version.
> > >
> > > Does this sound reasonable?
> > >
> > > Thanks,
> > > Ankur
> > >
> > > [1]
> > >
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
> > >
>

Re: Removing Java Reference Runner code

2019-04-26 Thread Anton Kedin

If there is no plans to invest in ULR then it makes sense to remove it.

Going forward, however, I think we should try to document the higher level
approach we're taking with runners (and portability) now that we have
something working and can reflect on it. For example, couple of things that
are not 100% clear to me:
 - if the focus is on python runner for portability efforts, how does java
SDK (and other languages) tie into this? E.g. how do we run, test, measure,
and develop things (pipelines, aspects of the SDK, runner);
 - what's our approach to developing new features, should we make sure
python runner supports them as early as possible (e.g. schemas and SQL)?
 - java DirectRunner is still there:
- it is still the primary tool for java SDK development purposes, and
as Kenn mentioned in the linked threads it adds value by making sure users
don't rely on implementation details of specific runners. Do we have a
similar story for portable scenarios?
- I assume that extra validations in the DirectRunner have impact on
performance in various ways (potentially non-deterministic). While this
doesn't matter in some cases, it might do in others. Having a local runner
that is (better) optimized for execution would probably make more sense for
perf measurements, integration tests, and maybe even local production jobs.
Is this something potentially worth looking into?

Regards,
Anton


On Fri, Apr 26, 2019 at 4:41 AM Maximilian Michels  wrote:

> Thanks for following up with this. I have mixed feelings to see the
> portable Java DirectRunner go, but I'm in favor of this change because
> it removes a lot of code that we do not really make use of.
>
> -Max
>
> On 26.04.19 02:58, Kenneth Knowles wrote:
> > Thanks for providing all this background on the PR. It is very easy to
> > see where it came from. Definitely nice to have less code and fewer
> > things that can break. Perhaps lazy consensus is enough.
> >
> > Kenn
> >
> > On Thu, Apr 25, 2019 at 4:01 PM Daniel Oliveira  > > wrote:
> >
> > Hey everyone,
> >
> > I made a preliminary PR for removing all the Java Reference Runner
> > code (PR-8380 ) since I
> > wanted to see if it could be done easily. It seems to be working
> > fine, so I wanted to open up this discussion to make sure people are
> > still in agreement on getting rid of this code and that people don't
> > have any concerns.
> >
> > For those who need additional context about this, this previous
> > thread
> > <
> https://lists.apache.org/thread.html/b235f8ee55a737ea399756edd80b1218ed34d3439f7b0ed59bfa8e40@%3Cdev.beam.apache.org%3E
> >
> > is where we discussed deprecating the Java Reference Runner (in some
> > places it's called the ULR or Universal Local Runner, but it's the
> > same thing). Then there's this thread
> > <
> https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E
> >
> > where we discussed removing the code from the repo since it's been
> > deprecated.
> >
> > If no one has any objections to trying to remove the code I'll have
> > someone review the PR I wrote and start a vote to have it merged.
> >
> > Thanks,
> > Daniel Oliveira
> >
>

Re: Dependency management for multiple IOs

2019-02-19 Thread Anton Kedin

a
>>>>>
>>>>> For users of 1, they depend on Beam Java, Beam SQL, SQL Kafka Table,
>>>>> and pin a version of Kafka
>>>>> For users of 2, they depend on Beam Java, Beam SQL, KakfaIO, and pin a
>>>>> version of Kafka
>>>>>
>>>>> To be honest it is really hard to see which is preferable. I think
>>>>> number 1 has fewer funky dependency edges, more simple "compile + runtime"
>>>>> dependencies.
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Fri, Feb 15, 2019 at 6:06 PM Chamikara Jayalath <
>>>>> chamik...@google.com> wrote:
>>>>>
>>>>>> I think the underlying problem is two modules of Beam transitively
>>>>>> depending on conflicting dependencies (a.k.a. the diamond dependency
>>>>>> problem) ?
>>>>>>
>>>>>> I think the general solution for this is two fold. (at least the way
>>>>>> we have formulated in
>>>>>> https://beam.apache.org/contribute/dependencies/)
>>>>>>
>>>>>> (1) Keep Beam dependencies as much as possible hoping that transitive
>>>>>> dependencies stay compatible (we rely on semantic versioning here to not
>>>>>> cause problems for differences in minor/patch versions. Might not be the
>>>>>> case in practice for some dependencies).
>>>>>> (2) For modules with outdated dependencies that we cannot upgrade due
>>>>>> to some reason, we'll vendor those modules.
>>>>>>
>>>>>> Not sure if your specific problem need something more.
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>> On Fri, Feb 15, 2019 at 4:48 PM Anton Kedin  wrote:
>>>>>>
>>>>>>> Hi dev@,
>>>>>>>
>>>>>>> I have a problem, I don't know a good way to approach the dependency
>>>>>>> management between Beam SQL and Beam IOs, and want to collect thoughts
>>>>>>> about it.
>>>>>>>
>>>>>>> Beam SQL depends on specific IOs so that users can query them. The
>>>>>>> IOs need their dependencies to work. Sometimes the IOs also leak their
>>>>>>> transitive dependencies (e.g. HCatRecord leaked from HCatalogIO). So if 
>>>>>>> in
>>>>>>> SQL we want to build abstractions on top of these IOs we risk having to
>>>>>>> bundle the whole IOs or the leaked dependencies. Overall we can probably
>>>>>>> avoid it by making the IOs `provided` dependencies, and by refactoring 
>>>>>>> the
>>>>>>> code that leaks. In this case things can be made to build, simple tests
>>>>>>> will run, and we won't need to bundle the IOs within SQL.
>>>>>>>
>>>>>>> But as soon as there's a need to actually work with multiple IOs at
>>>>>>> the same time the conflicts appear. For example, for testing of
>>>>>>> Hive/HCatalog IOs in SQL we need to create an embedded Hive Metastore
>>>>>>> instance. It is a very Hive-specific thing that requires its own
>>>>>>> dependencies that have to be loaded during testing as part of SQL 
>>>>>>> project.
>>>>>>> And some other IOs (e.g. KafkaIO) can bring similar but conflicting
>>>>>>> dependencies which means that we cannot easily work with or test both 
>>>>>>> IOs
>>>>>>> at the same time within SQL. I think it will become insane as number of 
>>>>>>> IOs
>>>>>>> supported in SQL grows.
>>>>>>>
>>>>>>> So the question is how to avoid conflicts between IOs within SQL?
>>>>>>>
>>>>>>> One approach is to create separate packages for each of the
>>>>>>> SQL-specific IO wrappers, e.g. 
>>>>>>> `beam-sdks-java-extensions-sql-hcatalog`, 
>>>>>>> `beam-sdks-java-extensions-sql-kafka`,
>>>>>>> etc. These projects will compile-depend on Beam SQL and on specific IO.
>>>>>>> Beam SQL will load these either from user-specified configuration or
>>&g

Dependency management for multiple IOs

2019-02-15 Thread Anton Kedin

Hi dev@,

I have a problem, I don't know a good way to approach the dependency
management between Beam SQL and Beam IOs, and want to collect thoughts
about it.

Beam SQL depends on specific IOs so that users can query them. The IOs need
their dependencies to work. Sometimes the IOs also leak their transitive
dependencies (e.g. HCatRecord leaked from HCatalogIO). So if in SQL we want
to build abstractions on top of these IOs we risk having to bundle the
whole IOs or the leaked dependencies. Overall we can probably avoid it by
making the IOs `provided` dependencies, and by refactoring the code that
leaks. In this case things can be made to build, simple tests will run, and
we won't need to bundle the IOs within SQL.

But as soon as there's a need to actually work with multiple IOs at the
same time the conflicts appear. For example, for testing of Hive/HCatalog
IOs in SQL we need to create an embedded Hive Metastore instance. It is a
very Hive-specific thing that requires its own dependencies that have to be
loaded during testing as part of SQL project. And some other IOs (e.g.
KafkaIO) can bring similar but conflicting dependencies which means that we
cannot easily work with or test both IOs at the same time within SQL. I
think it will become insane as number of IOs supported in SQL grows.

So the question is how to avoid conflicts between IOs within SQL?

One approach is to create separate packages for each of the SQL-specific IO
wrappers, e.g. `beam-sdks-java-extensions-sql-hcatalog`,
`beam-sdks-java-extensions-sql-kafka`,
etc. These projects will compile-depend on Beam SQL and on specific IO.
Beam SQL will load these either from user-specified configuration or
something like @AutoService at runtime. This way Beam SQL doesn't know
about the details of the IOs and their dependencies, and they can be easily
tested in isolation without conflicting with each other. This should also
be relatively simple to manage if things change, the build logic should be
straightforward and easy to update. On the negative side, each of the
projects will require its own separate build logic, it will not be easy to
test multiple IOs together within SQL, and users will have to manage the
conflicting dependencies by themselves.

Another approach is to keep things roughly as they are but create separate
configurations within the main `build.gradle` in SQL project, where
configurations will correspond to separate IOs or use cases (e.g. testing
of Hive-related IOs). The benefit is that everything related to SQL IOs
stays roughly in one place (including build logic) and can be built and
tested together when possible. Negative side is that it will probably
involve some groovy magic and classpath manipulation within Gradle tasks to
make the configurations work, plus it may be brittle if we change our
top-level Beam build logic. And this approach also doesn't make it easier
for the users to manage the conflicts.

Longer term we could probably also reduce the abstraction thickness on top
of the IOs, so that Beam SQL can work directly with IOs. For this to work
the supported IOs will need to expose things like `readRows()` and get/set
the schema on the PCollection. This is probably aligned with the Schema
work that's happening at the moment but I don't know whether it makes sense
to focus on this right now. The problem of the dependencies is not solved
here as well but I think it will be at least the same problem as the users
already have if they see conflicts when using mutliple IOs with Beam
pipelines.'

Thoughts, ideas? Did anyone ever face a problem like this or am I
completely misunderstanding something in  Beam build logic?

Regards,
Anton

[SQL] External schema providers

2019-02-14 Thread Anton Kedin

Hi dev@,

A quick update about a new Beam SQL feature.

In short, we have wired up the support for plugging table providers through
Beam SQL API to allow obtaining table schemas from external sources.

*What does it even mean?*

Previously, in Java pipelines, you could apply a Beam SQL query to existing
PCollections. We have a special SqlTransform to do that, it converts a SQL
query to an equivalent PTransform that is applied to the PCollection of Rows
.

One major inconvenience in this approach is that to query something, it has
to be a PCollection. I.e. you have to read the data from a specific source
and then convert it to rows. Which can mean multiple complications, like
potentially manually converting schemas from source to Beam, or having a
completely different logic when changing the source.

The new API allows you to plug a schema provider that can resolve the
tables and schemas automatically if they already exist somewhere else. This
way Beam SQL, with the help of the provider, does the table lookup, then IO
configuration, and then schema conversion if needed.

As an example, here's a query
[1]
that joins 2 existing PCollections with a table from Hive using
HCatalogTableProvider. Hive table lookup is automatic, the table provider
in this case will resolve the tables by talking to Hive Metastore and will
read the data by configuring and applying the HCatalogIO, converting the
records to Rows under the hood.

*What's the status of this?*

This is a working implementation, but the development is still ongoing,
there are bugs, API might change, and there are few more things I can see
coming related to this after further design discussions:

* refactor of the underlying table/metadata provider code;
* working out the design for supporting creating / updating the tables in
the metadata provider;
* creating a DDL syntax for it;
* creating more providers;

[1]
https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190

Re: Findbugs -> Spotbugs ?

2019-01-31 Thread Anton Kedin

It would be nice. How fast is it on Beam codebase?

Regards,
Anton

On Thu, Jan 31, 2019 at 10:38 AM Udi Meiri  wrote:

> +1 for spotbugs
>
> On Thu, Jan 31, 2019 at 5:09 AM Gleb Kanterov  wrote:
>
>> Agree, spotbugs brings static checks that aren't covered in error-prone,
>> it's a good addition. There are few conflicts between error-prone and
>> spotbugs, for instance, the approach to enum switch exhaustiveness, but it
>> can be configured.
>>
>> On Thu, Jan 31, 2019 at 10:53 AM Ismaël Mejía  wrote:
>>
>>> Not a blocker but there is not a spotbugs plugin for IntelliJ.
>>>
>>> On Thu, Jan 31, 2019 at 10:45 AM Ismaël Mejía  wrote:
>>> >
>>> > YES PLEASE let's move to spotbugs !
>>> > Findbugs has not had a new release in ages, and does not support Java
>>> > 11 either, so this will address another possible issue.
>>> >
>>> > On Thu, Jan 31, 2019 at 8:28 AM Kenneth Knowles 
>>> wrote:
>>> > >
>>> > > Over the last few hours I activated findbugs on the Dataflow Java
>>> worker and fixed or suppressed the errors. They started around 60 but
>>> fixing some uncovered others, etc. You can see the result at
>>> https://github.com/apache/beam/pull/7684.
>>> > >
>>> > > It has convinced me that findbugs still adds value, beyond
>>> errorprone and nullaway/checker/infer. Quite a few of the issues were not
>>> nullability related, though nullability remains the most obvious
>>> low-hanging fruit where a different tool would do even better than
>>> findbugs. I have not yet enable "non null by default" which exposes 100+
>>> new bugs in the worker, at minimum.
>>> > >
>>> > > Are there known blockers for upgrading to spotbugs so we are
>>> depending on an active project?
>>> > >
>>> > > Kenn
>>>
>>
>>
>> --
>> Cheers,
>> Gleb
>>
>

Re: [ANNOUNCE] New committer announcement: Gleb Kanterov

2019-01-25 Thread Anton Kedin

Congrats!

On Fri, Jan 25, 2019 at 8:54 AM Ismaël Mejía  wrote:

> Well deserved, congratulations Gleb!
>
> On Fri, Jan 25, 2019 at 10:47 AM Etienne Chauchot 
> wrote:
> >
> > Congrats Gleb and welcome onboard !
> >
> > Etienne
> >
> > Le vendredi 25 janvier 2019 à 10:39 +0100, Alexey Romanenko a écrit :
> >
> > Congrats to Gleb and welcome on board!
> >
> > On 25 Jan 2019, at 09:22, Tim Robertson 
> wrote:
> >
> > Welcome Gleb and congratulations!
> >
> > On Fri, Jan 25, 2019 at 8:06 AM Kenneth Knowles  wrote:
> >
> > Hi all,
> >
> > Please join me and the rest of the Beam PMC in welcoming a new
> committer: Gleb Kanterov
> >
> > Gleb started contributing to Beam and quickly dove deep, doing some
> sensitive fixes to schemas, also general build issues, Beam SQL, Avro, and
> more. In consideration of Gleb's technical and community contributions, the
> Beam PMC trusts Gleb with the responsibilities of a Beam committer [1].
> >
> > Thank you, Gleb, for your contributions.
> >
> > Kenn
> >
> > [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >
> >
>

Re: compileJava broken on master see: BEAM-6495

2019-01-23 Thread Anton Kedin

We don't pre-generate the code as a separate step. Code gen from the SQL
parser syntax spec and its compilation happens both during the Beam SQL
build task. Splitting the code generation and compilation might not be
trivial. We definitely should look into fixing this though.

Regards,
Anton

On Wed, Jan 23, 2019 at 11:13 AM Alex Amato  wrote:

> Okay, make sense perhaps we can somehow make it fail when it fails to
> generate the dep, rather than when compiling the java code later on
>
> On Wed, Jan 23, 2019 at 11:12 AM Anton Kedin  wrote:
>
>> ParserImpl is autogenerated by Calcite at build time. It seems that
>> there's a race condition there and it sometimes fails. Rerunning the build
>> works for me.
>>
>> Regards,
>> Anton
>>
>> On Wed, Jan 23, 2019, 11:06 AM Alex Amato  wrote:
>>
>>> https://jira.apache.org/jira/browse/BEAM-6495?filter=-2
>>>
>>> Any ideas, how this got through the precommit?
>>>
>>> > Task :beam-sdks-java-extensions-sql:compileJava FAILED
>>>
>>> /usr/local/google/home/ajamato/go/src/
>>> github.com/apache/beam/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/JdbcFactory.java:29:
>>> error: cannot find symbol
>>>
>>> import
>>> org.apache.beam.sdk.extensions.sql.impl.parser.impl.BeamSqlParserImpl;
>>>
>>>   ^
>>>
>>>   symbol:   class BeamSqlParserImpl
>>>
>>>   location: package org.apache.beam.sdk.extensions.sql.impl.parser.impl
>>>
>>> 1 error
>>>
>>>

Re: compileJava broken on master see: BEAM-6495

2019-01-23 Thread Anton Kedin

ParserImpl is autogenerated by Calcite at build time. It seems that there's
a race condition there and it sometimes fails. Rerunning the build works
for me.

Regards,
Anton

On Wed, Jan 23, 2019, 11:06 AM Alex Amato  wrote:

> https://jira.apache.org/jira/browse/BEAM-6495?filter=-2
>
> Any ideas, how this got through the precommit?
>
> > Task :beam-sdks-java-extensions-sql:compileJava FAILED
>
> /usr/local/google/home/ajamato/go/src/
> github.com/apache/beam/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/JdbcFactory.java:29:
> error: cannot find symbol
>
> import
> org.apache.beam.sdk.extensions.sql.impl.parser.impl.BeamSqlParserImpl;
>
>   ^
>
>   symbol:   class BeamSqlParserImpl
>
>   location: package org.apache.beam.sdk.extensions.sql.impl.parser.impl
>
> 1 error
>
>

Re: Why does Beam not use the google-api-client libraries?

2019-01-02 Thread Anton Kedin

I don't have enough context to answer all of the questions, but looking at
PubsubIO it seems to use the official libraries, e.g. see Pubsub doc [1]
vs Pubsub IO GRPC client [2]. Correct me if I misunderstood your question.

[1]
https://cloud.google.com/pubsub/docs/publisher#pubsub-publish-message-java
[2]
https://github.com/apache/beam/blob/2e759fecf63d62d110f29265f9438128e3bdc8ab/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubGrpcClient.java#L189

Pubsub IO JSON client seems to use a slightly different approach but still
relies on somewhat official path, e.g. Pubsub doc [3] (javadoc[4]) vs
Pubsub IO JSON client [5].

[3] https://developers.google.com/api-client-library/java/apis/pubsub/v1
[4]
https://developers.google.com/resources/api-libraries/documentation/pubsub/v1/java/latest/com/google/api/services/pubsub/Pubsub.html
[5]
https://github.com/apache/beam/blob/2e759fecf63d62d110f29265f9438128e3bdc8ab/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubJsonClient.java#L130

The latter seems to be the older library, so I would assume it's for legacy
reasons.

Regards,
Anton


On Wed, Jan 2, 2019 at 9:03 AM Jeff Klukas  wrote:

> I'm building a high-volume Beam pipeline using PubsubIO and running into
> some concerns over performance and delivery semantics, prompting me to want
> to better understand the implementation. Reading through the library,
> PubsubIO appears to be a completely separate implementation of Pubsub
> client behavior from Google's own Java client. As a developer trying to
> read and understand the implementation, this is a significant hurdle, since
> any previous knowledge of the Google library is not applicable and is
> potentially at odds with what's in PubsubIO.
>
> Why doesn't beam use the Google clients for PubsubIO, BigQueryIO, etc.? Is
> it for historical reasons? Is there difficulty in packaging and integration
> of the Google clients? Or are the needs for Beam just substantially
> different from what the Google libraries provide?
>

Re: [RFC] I made a new tabbed Beam view in Jenkins

2018-12-18 Thread Anton Kedin

This is really helpful, didn't realize it was possible. Categories and
contents look reasonable. I think something like this definitely should be
the top-level Beam view.

Regards,
Anton

On Tue, Dec 18, 2018 at 12:05 PM Kenneth Knowles  wrote:

> Hi all,
>
> I made a new view to split Beam builds into tabs:
> https://builds.apache.org/view/A-D/view/Beam%20Nested/
>
>  - PostCommit tab includes PostCommit and "PreCommit_.*_Cron" because
> these are actually post-commit jobs; it is a feature not a bug.
>  - PreCommit tab includes jobs that have no meaningful history because
> they are just against PRs, commits, phrase triggering
>  - Inventory self-explanatory
>  - PerformanceTests self-explanatory
>  - All; I didn't want to keep making categories but just send this for
> feedback
>
> WDYT about making this the top-level Beam view? (vs
> https://builds.apache.org/view/A-D/view/Beam/)
>
> After that, maybe we could clean the categories so they fit into the tabs
> more easily with fewer regexes (to make sure things don't get missed). I
> have read also that if you use / instead of _ as a separator in a name then
> Jenkins will display jobs as nested in folders automatically. Not sure it
> actually results in a better view; haven't tried it.
>
> Kenn
>

Re: [DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Anton Kedin

I think this approach makes sense in general, Euphoria can be the
implementation detail of SQL, similar to Join Library or core SDK Schemas.

I wonder though whether it would be better to bring Euphoria closer to core
SDK first, maybe even merge them together. If you look at Reuven's recent
work around schemas it seems like there are already similarities between
that and Euphoria's approach, unless I'm missing the point (e.g. Filter
transforms, FullJoin vs CoGroup... see [2]). And we're already switching
parts of SQL to those transforms (e.g. SQL Aggregation is now implemented
by core SDK's Group[3]).

Adding explicit Schema support to Euphoria will bring it both closer to
core SDK and make it natural to use for SQL. Can this be a first step
towards this integration?

One question I have is, does Euphoria bring dependencies that are not
needed by SQL, or does more or less only rely on the core SDK?

[1]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
[2]
https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
[3]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179



On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský  wrote:

> Hi community,
>
> I'm part of Euphoria DSL team, and on behalf of this team, I'd like to
> discuss possible development of Java based DSLs currently present in
> Beam. In my knowledge, there are currently two DSLs based on Java SDK -
> Euphoria and SQL. These DSLs currently share only the SDK itself,
> although there might be room to share some more effort. We already know
> that both Euphoria and SQL have need for retractions, but there are
> probably many more features that these two could share.
>
> So, I'd like to open a discussion on what it would cost and what it
> would possibly bring, if instead of the current structure
>
>Java SDK
>
>  |  SQL
>
>  |  Euphoria
>
> these DSLs would be structured as
>
>Java SDK ---> Euphoria ---> SQL
>
> I'm absolutely sure that this would be a great investment and a huge
> change, but I'd like to gather some opinions and general feelings of the
> community about this. Some points to start the discussion from my side
> would be, that structuring DSLs like this has internal logical
> consistency, because each API layer further narrows completeness, but
> brings simpler API for simpler tasks, while adding additional high-level
> view of the data processing pipeline and thus enabling more
> optimizations. On Euphoria side, these are various implementations joins
> (most effective implementation depends on data), pipeline sampling and
> more. Some (or maybe most) of these optimizations would have to be
> implemented in both DSLs, so implementing them once is beneficial.
> Another benefit is that this would bring Euphoria "closer" to Beam core
> development (which would be good, it is part of the project anyway,
> right? :)) and help better drive features, that although currently
> needed mostly by SQL, might be needed by other Java users anyway.
>
> Thanks for discussion and looking forward to any opinions.
>
>Jan
>
>

Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-15 Thread Anton Kedin

One reason is that @AutoValue is not guaranteed to be retained at runtime:
https://github.com/google/auto/blob/master/value/src/main/java/com/google/auto/value/AutoValue.java#L44


On Thu, Nov 15, 2018 at 11:36 AM Kenneth Knowles  wrote:

> Just some low-level detail: If there is no @DefaultSchema annotation but
> it is an @AutoValue class, can schema inference go ahead with the
> AutoValueSchema? Then the user doesn't have to do anything.
>
> Kenn
>
> On Wed, Nov 14, 2018 at 6:14 AM Reuven Lax  wrote:
>
>> We already have a framework for ByteBuddy codegen for JavaBean Row
>> interfaces, which should hopefully be easy to extend AutoValue (and more
>> efficient than using reflection). I'm working on adding constructor support
>> to this right now.
>>
>> On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas  wrote:
>>
>>> Sounds, then, like we need to a define a new `AutoValueSchema extends
>>> SchemaProvider` and users would opt-in to this via the DefaultSchema
>>> annotation:
>>>
>>> @DefaultSchema(AutoValueSchema.class)
>>> @AutoValue
>>> public abstract MyClass ...
>>>
>>> Since we already have the JavaBean and JavaField reflection-based schema
>>> providers to use as a guide, it sounds like it may be best to try to
>>> implement this using reflection rather than implementing an AutoValue
>>> extension.
>>>
>>> A reflection-based approach here would hinge on being able to discover
>>> the package-private constructor for the concrete class and read its types.
>>> Those types would define the schema, and the fromRow impementation would
>>> call the discovered constructor.
>>>
>>> On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax  wrote:
>>>


 On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas 
 wrote:

> Reuven - A SchemaProvider makes sense. It's not clear to me, though,
> whether that's more limited than a Coder. Do all values of the schema have
> to be simple types, or does Beam SQL support nested schemas?
>

 Nested schemas, collection types (lists and maps), and collections of
 nested types are all supported.

>
> Put another way, would a user be able to create an AutoValue class
> comprised of simple types and then use that as a field inside another
> AutoValue class? I can see how that's possible with Coders, but not clear
> whether that's possible with Row schemas.
>

 Yes, this is explicitly supported.

>
> On Fri, Nov 9, 2018 at 8:22 PM Reuven Lax  wrote:
>
>> Hi Jeff,
>>
>> I would suggest a slightly different approach. Instead of generating
>> a coder, writing a SchemaProvider that generates a schema for AutoValue.
>> Once a PCollection has a schema, a coder is not needed (as Beam knows how
>> to encode any type with a schema), and it will work seamlessly with Beam
>> SQL (in fact you don't need to write a transform to turn it into a Row 
>> if a
>> schema is registered).
>>
>> We already do this for POJOs and basic JavaBeans. I'm happy to help
>> do this for AutoValue.
>>
>> Reuven
>>
>> On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas 
>> wrote:
>>
>>> Hi all - I'm looking for some review and commentary on a proposed
>>> design for providing built-in Coders for AutoValue classes. There's
>>> existing discussion in BEAM-1891 [0] about using AvroCoder, but that's
>>> blocked on incompatibility between AutoValue and Avro's reflection
>>> machinery that don't look resolvable.
>>>
>>> I wrote up a design document [1] that instead proposes using
>>> AutoValue's extension API to automatically generate a Coder for each
>>> AutoValue class that users generate. A similar technique could be used 
>>> to
>>> generate conversions to and from Row for use with BeamSql.
>>>
>>> I'd appreciate review of the design and thoughts on whether this
>>> seems feasible to support within the Beam codebase. I may be missing a
>>> simpler approach.
>>>
>>> [0] https://issues.apache.org/jira/browse/BEAM-1891
>>> [1]
>>> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>>>
>>

Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-09 Thread Anton Kedin

Hi Jeff,

I think this is a great idea! Thank you for working on the proposal. I left
couple of comments in the doc.

Have you tried prototyping this?

Regards,
Anton

On Fri, Nov 9, 2018 at 1:50 PM Jeff Klukas  wrote:

> Hi all - I'm looking for some review and commentary on a proposed design
> for providing built-in Coders for AutoValue classes. There's existing
> discussion in BEAM-1891 [0] about using AvroCoder, but that's blocked on
> incompatibility between AutoValue and Avro's reflection machinery that
> don't look resolvable.
>
> I wrote up a design document [1] that instead proposes using AutoValue's
> extension API to automatically generate a Coder for each AutoValue class
> that users generate. A similar technique could be used to generate
> conversions to and from Row for use with BeamSql.
>
> I'd appreciate review of the design and thoughts on whether this seems
> feasible to support within the Beam codebase. I may be missing a simpler
> approach.
>
> [0] https://issues.apache.org/jira/browse/BEAM-1891
> [1]
> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>

Stackoverflow Questions

2018-11-05 Thread Anton Kedin

Hi dev@,

I was looking at stackoverflow questions tagged with `apache-beam` [1] and
wanted to ask your opinion. It feels like it's easier for some users to ask
questions on stackoverflow than on user@. Overall frequency between the two
channels seems comparable but a lot of stackoverflow questions are not
answered while questions on user@ get some attention most of the time.
Would it make sense to increase dev@ visibility into stackoverflow, e.g. by
sending periodic digest or some other way?

[1] https://stackoverflow.com/questions/tagged/apache-beam

Regards,
Anton

Re: Fixing equality of Rows

2018-10-29 Thread Anton Kedin

About these specific use cases, how useful is it to support Map
and List? These seem pretty exotic (maybe they aren't) and I wonder
whether it would make sense to just reject them until we have a solid
design.

And wouldn't the same problems arise even without RowCoder? Is the path in
that case to implement a custom coder?

Regards,
Anton


On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles  wrote:

> I'll summarize my input to the discussion. It is rather high level. But
> IMO:
>
>  - even though schemas are part of Beam Java today, I think they should
> become part of portability when ready
>  - so each type in a schema needs a language-independent &
> encoding-independent notion of domain of values and equality (so obviously
> equal bytes are equal)
>  - embedding in any language (hence Row in Java) must have a schema
> type-driven equality that matches this spec
>  - also each type (hence Row type) should have portable encoding(s) that
> respect this equality so shuffling is consistent
>  - Row in Java should be able to decode these encodings to different
> underlying representations and change its strategy over time
>
> Kenn
>
> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov  wrote:
>
>> With adding BYTES type, we broke equality.
>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>> hold for exotic types such as `Map`, `List`. The root
>> cause is `byte[]`, where `equals` is implemented as reference equality
>> instead of structural.
>>
>> Before we jump into solution mode, let's state what we want to have:
>> - *API* have stable API and be able to evolve efficient and use-case
>> specific implementations without breaking it
>> - *Correctness *we can't trade off correctness, a trivial implementation
>> should work
>> - *Performance *comparing equality is a fundamental operation, and we
>> want to make it cheap
>>
>> 1. set `consistentWithEquals` if there is BYTES field
>> Pros: almost no pros
>> Cons: It would introduce a significant number of allocations when
>> comparing rows, so we reject this option.
>>
>> 2. implement custom deep equals in `Row#equals`
>> Pros: good performance, doesn't change API, `Row#equals` is correct
>> Cons: doesn't work for `Map`, unless we roll own implementation
>> Cons: it's possible to obtain `List` from `getValue()` that has
>> broken equality, contains, etc, unless we roll own implementation
>> Cons: non-trivial and requires ~200LOC to implement
>>
>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>> StructuralByteArray)
>> Pros: good performance and flexible to change how Java wrapper is
>> implemented
>> Pros: simple, doesn't require any specialized collections, no surprises,
>> `Map` and `List` work.
>> Cons: will change the return type of `Row#getValue`
>>
>> I want to suggest going with option #3. However, it isn't completely
>> clear what wrapper we want to use, either it could be StructuralByteArray,
>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>> additional integer fields. StructuralByteArray doesn't have anything not
>> necessary. One option would be adding `Row#getByteBuffer` that would be
>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>> override it for better performance, but `getValue(i)` must return
>> StructuralByteArray.
>>
>> References:
>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>> https://github.com/apache/beam/pull/6765
>>
>> Gleb
>>
>

Re: Java postcommits duration almost hit 4 hours

2018-10-12 Thread Anton Kedin

Not sure where other perf issues are coming from, but this specific BQ test
suite was disabled yesterday: https://github.com/apache/beam/pull/6658

On Fri, Oct 12, 2018 at 3:20 PM Kenneth Knowles  wrote:

> Nice catch. Here is a build that went from 2.5 to 3 hours:
> https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_GradleBuild/1654/
> looks like it added some BQ tests. Not sure that can account for it.
>
> From there it was red for some time and slow once it went green again and
> was already slowed.
>
> I'd suggest that postcommit Dataflow integration tests have their own
> build that does not run anything else. That's something you can do now that
> we use Gradle that was not possible with mvn (except via Jenkins shell job).
>
> Kenn
>
> On Fri, Oct 12, 2018 at 11:33 AM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> I just wanted to highlight an interesting fact: Our java postcommits
>> duration almost *doubled* since last week rising from 2.2 to nearly 4
>> hours.  (See bottom-left graph on this dashboard
>> )
>>
>> We might want to check on the tests we are adding lately.
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback ?
>>
>

Re: Java compilation errors

2018-10-11 Thread Anton Kedin

It's being discussed on Slack at the moment, the issue seems to be the new
errorprone version which has new checks.

On Thu, Oct 11, 2018 at 10:23 AM Mikhail Gryzykhin 
wrote:

> Hi everyone,
>
> Just a heads up:
>
> I see that Java builds have compilation failures. Can someone help look
> into this please?
>
> Unfortunately, I don't have time to look into it myself atm.
>
> Regards,
> --Mikhail
>
> Have feedback ?
>

Re: [Proposal] Euphoria DSL - looking for reviewers

2018-10-10 Thread Anton Kedin

I think the code looks good and we should probably just merge it (unless
there are other blockers, e.g. formal approvals), considering:
 - it has been reviewed;
 - it is tested and used in production;
 - it was discussed on the list and there were no objections to having it
as part of Beam;
 - it is a standalone extension and doesn't interfere with Beam Java SDK,
if I didn't miss anything;
 - it has people working on it and supporting it;

All other issues can probably be sorted out in normal Beam process.

Regards,
Anton

On Wed, Oct 10, 2018 at 5:57 AM David Morávek 
wrote:

> Hello Max,
>
> It would be great if you can do more of a "general" review, the code base
> is fairly large, well tested and it was already reviewed internally by
> several people.
>
> We would like to have the overall approach and design decisions validated
> by the community and get some inputs on what could be improved and if we
> are headed the right direction.
>
> Thanks,
> David
>
> On Wed, Oct 10, 2018 at 2:21 PM Maximilian Michels  wrote:
>
>> That is a huge PR! :) Euphoria looks great. Especially for people coming
>> from Flink/Spark. I'll check out the documentation.
>>
>> Do you have any specific code parts which you want to have reviewed?
>>
>> Thanks,
>> Max
>>
>> On 10.10.18 10:30, Jean-Baptiste Onofré wrote:
>> > Hi,
>> >
>> > Thanks for all the work you are doing on this DSL !
>> >
>> > I tried to follow the features branch for a while. I'm still committed
>> > to  move forward on that front,  but more reviewers would be great.
>> >
>> > Regards
>> > JB
>> >
>> > On 10/10/2018 10:26, Plajt, Vaclav wrote:
>> >> Hello Beam devs,
>> >> we finished our main goals in development of Euphoria DSL. It is Easy
>> to
>> >> use Java 8 API build on top of the Beam's Java SDK. API provides a
>> >> high-level abstraction of data transformations, with focus on the Java
>> 8
>> >> language features (e.g. lambdas and streams). It is fully
>> inter-operable
>> >> with existing Beam SDK and convertible back and forth. It allows fast
>> >> prototyping through use of (optional) Kryo based coders and can be
>> >> seamlessly integrated into existing Beam Pipelines.
>> >>
>> >> Now we believe that it is the time to start discussion about it with
>> the
>> >> community. Which will hopefully lead to vote about adapting it into
>> >> Apache Beam project. Most of main ideas and development goals were
>> >> presented in Beam Summit in London [1].
>> >>
>> >> We are looking for reviewers within the community. Please start with
>> >> documentation [2] or design document [3]. Our contribution is divided
>> to
>> >> two modules: `org.apache.beam:beam-sdks-java-extensions-euphoria` and
>> >> `org.apache.beam:beam-sdks-java-extensions-kryo`. Rest of the code base
>> >> remains untouched.
>> >> All the checks in MR [5] are passing with exception of "Website
>> >> PreCommit". Which seems to be broken, little help here would be
>> appreciated.
>> >>
>> >> Thank you
>> >> We are looking forward for your feedback.
>> >> {david.moravek,vaclav.plajt,marek.simunek}@firma.seznam.cz
>> >>
>> >> Resources:
>> >> [1] Beam Summit London presentation:
>> >>
>> https://docs.google.com/presentation/d/1SagpmzJ-tUQki5VsQOEEEUyi_LXRJdG_3OBLdjBKoh4/edit?usp=sharing
>> >> [2] Documentation:
>> >>
>> https://github.com/seznam/beam/blob/dsl-euphoria/website/src/documentation/sdks/euphoria.md
>> >> [3] Design Document: https://s.apache.org/beam-euphoria
>> >> [4] ASF Jira Issue: https://issues.apache.org/jira/browse/BEAM-3900
>> >> [5] Pull Request: https://github.com/apache/beam/pull/6601
>> >> [6] Original proposal:
>> >>
>> http://mail-archives.apache.org/mod_mbox/beam-dev/201712.mbox/%3ccajjqkhnrp1z8atteogmpfkqxrcjeanb3ykowvvtnwyrvv_-...@mail.gmail.com%3e
>> >>
>> >>
>> >>
>> >> Je dobré vědět, že tento e-mail a přílohy jsou důvěrné. Pokud spolu
>> >> jednáme o uzavření obchodu, vyhrazujeme si právo naše jednání kdykoli
>> >> ukončit. Pro fanoušky právní mluvy - vylučujeme tím ustanovení
>> >> občanského zákoníku o předsmluvní odpovědnosti. Pravidla o tom, kdo u
>> >> nás a jak vystupuje za společnost a kdo může co a jak podepsat
>> naleznete
>> >> zde 
>> >>
>> >> You should know that this e-mail and its attachments are confidential.
>> >> If we are negotiating on the conclusion of a transaction, we reserve
>> the
>> >> right to terminate the negotiations at any time. For fans of
>> legalese—we
>> >> hereby exclude the provisions of the Civil Code on pre-contractual
>> >> liability. The rules about who and how may act for the company and what
>> >> are the signing procedures can be found here
>> >> .
>> >
>>
>

Re: Jira Integration with Github

2018-10-09 Thread Anton Kedin

Assuming this is a github-only plugin, why does it have to go through ASF?

On Tue, Oct 9, 2018 at 3:20 AM Maximilian Michels  wrote:

> Hi Kai,
>
> This needs to be supported by the ASF first. So the best idea would be
> to propose this to the INFRA team. Or post it on the ASF community
> mailing list.
>
> Best,
> Max
>
> On 09.10.18 06:37, Kai Jiang wrote:
> > Hi all,
> >
> > Github has announced the official support for Jira integraton
> >
> https://blog.github.com/2018-10-04-announcing-the-new-github-and-jira-software-cloud-integration/
> > .
> >
> > Is it possible to enable it for Apache Beam's Jira tickets? It could
> > help with automation of issue workflows in Jira.
> > Maybe, we could open an Infra ticket to see if the integration works for
> us.
> >
> > Best,
> > Kai
> > ᐧ
>

Java SDK Extensions

2018-10-03 Thread Anton Kedin

Hi dev@,

*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
understand.

*Current State:*

I was looking at `sdks/java/extensions`[1] and realized that I don't know
what half of those things are. Only `join library` and `sorter` seem to be
documented and discoverable on Beam website, under SDKs section [2].

Here's the list of all extensions with my questions/comments:
 - *google-cloud-platform-core*. What is this? Is this used in GCP IOs? If
so, is `extensions` the right place for it? If it is, then why is it a
`-core` extension? It feels like it's a utility package, not an extension;
 - *jackson*. I can guess what it is but we should document it somewhere;
 - *join-library*. It is documented, but I think we should add more
documentation to explain how it works, maybe some caveats, and link to/from
the `CoGBK` section of the doc;
 - *protobuf*. I can probably guess what is it. Is 'extensions' the right
place for it though? We use this library in IOs (`PubsubsIO.readProtos()`),
should we move it to IO then? Same as with GCP extension, feels like a
utility library, not an extension;
 - *sketching*. No idea what to expect from this without reading the code;
 - *sorter*. Documented on the website;
 - *sql*. This looks familiar :) It is documented but not linked from the
extensions section, it's unclear whether it's the whole SQL or just some
related components;

[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

*Questions:*

 - should we minimally document (at least describe) all extensions and add
at least short readme.md's with the links to the Beam website?
 - is it a right thing to depend on `extensions` in other components like
IOs?
 - would it make sense to move some things out of 'extensions'? E.g. IO
components to IO or utility package, SQL into new DSLs package;

*Opinion:*

Maybe I am misunderstanding the intent and meaning of 'extensions', but
from my perspective:

 - I think that extensions should be more or less isolated from the Beam
SDK itself, so that if you delete or modify them, no Beam-internal changes
will be required (changes to something that's not an extension). And my
feeling is that they should provide value by themselves to users other than
SDK authors. They are called 'extensions', not 'critical components' or
'sdk utilities';

 - I don't think that IOs should depend on 'extensions'. Otherwise the
question is, is it ok for other components, like runners, to do the same?
Or even core?

 - I think there are few distinguishable classes of things in 'extensions'
right now:
 - collections of `PTransforms` with some business logic (Sorter, Join,
Sketch);
 - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
 - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
used almost as a standalone SDK. Things like Euphoria will probably end up
in the same class;
 - utility libraries shared by some parts of the SDK and unclear if
they are valuable by themselves to external users (Protobuf, GCP core);
   To me the business logic and parsing libraries do make sense to stay in
extensions, but probably under different subdirectories. I think it will
make sense to split others out of extensions into separate parts of the
SDK.

 - I think we should add readme.md's with short descriptions and links to
Beam website;

Thoughts, comments?


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

Re: [DISCUSS] Committer Guidelines / Hygene before merging PRs

2018-09-28 Thread Anton Kedin

Is there an actual problem caused by squashing or not squashing the commits
that we face in the project? I personally have never needed to revert
something complicated that would be problematic either way (and don't have
a strong opinion about which way we should do it). From what I see so far
in the thread it doesn't look like reverting is a frequent major pain for
anyone. Maybe it is exactly because we're mostly following some best
practice and it makes it easy. If someone has concrete examples from their
experience in the project, please share them, this way it would be easier
to justify the choice.

The PR and commit cleanliness, size and isolation are probably more
important thing to have guidance and maybe enforcement for. There are well
known practices and guidelines that I think we should follow, and I think
they will make squashing or not squashing mostly irrelevant. For example,
if we accept that commits should have description that actually describes
what commit does, then "!fixup", "address comments" and similar should not
be part of the history and should be squashed before submitting the PR no
matter which way we decide to go in general. Also, I think that making
commits isolated is also a good practice, and PR author should be able to
relatively easily split the PR upon reviewer's request. And if we choose to
keep whole PRs small and incremental with descriptive isolated commits,
then there won't be too much difference how many commits there are.

Regards,
Anton

On Fri, Sep 28, 2018 at 8:21 AM Andrew Pilloud  wrote:

> I brought up this discussion a few months ago from the other side: I don't
> like my commits being squashed. I try to create logical commits that each
> passes tests and could be broken up into multiple PRs. Keeping those
> changes intact is useful from a history perspective and squashing may break
> other PRs I have in flight. If the intent is clear (one commit with a
> descriptive message and a bunch of "fixups"), then feel free to squash,
> otherwise ask first. When you do squash, it would be good to leave a
> comment as to how the author can avoid having their commits squashed in the
> future.
>
>
> https://lists.apache.org/thread.html/8d29e474e681ab9123280164d95075bb8b0b91486b66d3fa25ed20c2@%3Cdev.beam.apache.org%3E
>
> Andrew
>
> On Fri, Sep 28, 2018 at 7:29 AM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Thu, Sep 27, 2018 at 9:51 AM Robert Bradshaw 
>> wrote:
>>
>>> I agree that we should create a good pointer for cleaning up PRs, and
>>> request (though not require) that authors do it. It's unfortunate though
>>> that squashing during a review makes things difficult to follow, so adds
>>> one more round trip.
>>>
>>> We could consider for those PRs that make sense as a single logical
>>> commit (most, but not all, of them) simply using the "squash and merge"
>>> button even though it technically doesn't create a merge commit.
>>>
>>
>> +1 for allowing "squash and merge" as an option. Most of the reviews (at
>> least for me) consist of a single valid commit and several additional
>> commits that get piled up during the review process which obviously should
>> not be included in the commit history. Going through another round here
>> just to ask the author to fixup everything is unnecessarily time consuming.
>>
>> - Cham
>>
>>
>>>
>>>
>>> On Fri, Sep 21, 2018 at 9:24 PM Daniel Oliveira 
>>> wrote:
>>>
 As a non-committer I think some automated squashing of commits sounds
 best since it lightens the load of regular contributors, by not having to
 always remember to squash, and lightens the load of committers so it
 doesn't take as long to have your PR approved by one.

 But for now I think the second best route would be making it PR
 author's responsibility to squash fixup commits. Having that expectation
 described clearly in the Contributor's Guide, along with some simple
 step-by-step instructions for how to do so should be enough. I mainly
 support this because I've been doing the squashing myself since I saw a
 thread about it here a few months ago. It's not nearly as huge a burden on
 me as it probably is for committers who have to merge in many more PRs,
 it's very easy to learn how to do, and it's one less barrier to having my
 code merged in.

 Of course I wouldn't expect that committers wait for PR authors to
 squash their fixup commits, but I think leaving a message like "For future
 pull requests you should squash any small fixup commits, as described here:
 " should be fine.

> I was also thinking about the possibility of wanting to revert
> individual commits from a merge commit. The solution you propose
> works,
> but only if you want to revert everything.

 Does this happen often? I might not have enough context since I'm not a
 committer, but it seems to me that often the person performing a revert is
 not the original author of a

Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-19 Thread Anton Kedin

Congrats!

On Wed, Sep 19, 2018 at 1:36 PM Ankur Goenka  wrote:

> Congrats Kenn!
>
> On Wed, Sep 19, 2018 at 1:35 PM Amit Sela  wrote:
>
>> Well deserved! Congrats Kenn.
>>
>> On Wed, Sep 19, 2018 at 4:25 PM Kai Jiang  wrote:
>>
>>> Congrats, Kenn!
>>> ᐧ
>>>
>>> On Wed, Sep 19, 2018 at 1:23 PM Alan Myrvold 
>>> wrote:
>>>
 Congrats, Kenn.

 On Wed, Sep 19, 2018 at 1:08 PM Maximilian Michels 
 wrote:

> Congrats!
>
> On 19.09.18 22:07, Robin Qiu wrote:
> > Congratulations, Kenn!
> >
> > On Wed, Sep 19, 2018 at 1:05 PM Lukasz Cwik  > > wrote:
> >
> > Congrats Kenn.
> >
> > On Wed, Sep 19, 2018 at 12:54 PM Davor Bonaci  > > wrote:
> >
> > Hi everyone --
> > It is with great pleasure that I announce that at today's
> > meeting of the Foundation's Board of Directors, the Board has
> > appointed Kenneth Knowles as the second chair of the Apache
> Beam
> > project.
> >
> > Kenn has served on the PMC since its inception, and is very
> > active and effective in growing the community. His exemplary
> > posts have been cited in other projects. I'm super happy to
> have
> > Kenn accepted the nomination, and I'm confident that he'll
> serve
> > with distinction.
> >
> > As for myself, I'm not going anywhere. I'm still around and
> will
> > be as active as I have recently been. Thrilled to be able to
> > pass the baton to such a key member of this community and to
> > have less administrative work to do ;-).
> >
> > Please join me in welcoming Kenn to his new role, and I ask
> that
> > you support him as much as possible. As always, please let me
> > know if you have any questions.
> >
> > Davor
> >
>

Re: Migrating Beam SQL to Calcite's code generation

2018-09-17 Thread Anton Kedin

This is pretty amazing! Thank you for doing this!

Regards,
Anton

On Mon, Sep 17, 2018 at 2:27 PM Andrew Pilloud  wrote:

> I've adapted Calcite's EnumerableCalc code generation to generate the
> BeamCalc DoFn. The primary purpose behind this change is so we can take
> advantage of Calcite's extensive SQL operator implementation. This deletes
> ~11000 lines of code from Beam (with ~350 added), significantly increases
> the set of supported SQL operators, and improves performance and
> correctness of currently supported operators. Here is my work in progress:
> https://github.com/apache/beam/pull/6417
>
> There are a few bugs in Calcite that this has exposed:
>
> Fixed in Calcite master:
>
>- CALCITE-2321 
>- The type of a union of CHAR columns of different lengths should be 
> VARCHAR
>- CALCITE-2447  -
>Some POWER, ATAN2 functions fail with NoSuchMethodException
>
> Pending PRs:
>
>- CALCITE-2529 
>- linq4j should promote integer to floating point when generating function
>calls
>- CALCITE-2530 
>- TRIM function does not throw exception when the length of trim character
>is not 1(one)
>
> More work:
>
>- CALCITE-2404  -
>Accessing structured-types is not implemented by the runtime
>- (none yet) - Support multi character TRIM extension in Calcite
>
> I would like to push these changes in with these minor regressions. Do any
> of these Calcite bugs block this functionality being adding to Beam?
>
> Andrew
>

Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Anton Kedin

+1

On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold  wrote:

> +1
>
> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang  wrote:
>
>> +1
>>
>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde  wrote:
>>
>>> +1
>>>
>>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay  wrote:
>>>
 +1 (binding)

 On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik  wrote:

> +1 (binding)
>
> On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada 
> wrote:
>
>> +1
>>
>> On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud 
>> wrote:
>>
>>> +1
>>>
>>> On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik 
>>> wrote:
>>>
 There was generally positive support and good feedback[1] but it
 was not unanimous. I wanted to bring the donation of the Dataflow 
 worker
 code base to Apache Beam master to a vote.

 +1: Support having the Dataflow worker code as part of Apache Beam
 master branch
 -1: Dataflow worker code should live elsewhere

 1:
 https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E

>>>

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

2018-09-14 Thread Anton Kedin

Raising this topic once more. The PR[1] has been open for a while, if there
is no further input, I'm going to merge it by end of day.

[1]: https://github.com/apache/beam/pull/6252

Thank you,
Anton


On Wed, Aug 15, 2018 at 10:48 PM Tim  wrote:

> +1 for CREATE EXTERNAL TABLE with similar reasoning given by others on
> this thread.
>
> Tim
>
> On 15 Aug 2018, at 23:01, Charles Chen  wrote:
>
> +1 for CREATE EXTERNAL TABLE.  It is a good balance between the general
> SQL expectation of having tables as an abstraction and reinforcing that
> Beam does not store your data.
>
> On Wed, Aug 15, 2018 at 1:58 PM Rui Wang  wrote:
>
>> >  I think users will be more confused to find that 'CREATE TABLE'
>> doesn't exist then to learn that it might not always create a table.
>>
>> >> I think that having CREATE TABLE do something unexpected or not do
>> something expected (or do the opposite things depending on the table type
>> or some flag) is worse than having users look up the correct way of
>> creating a data source in Beam SQL without expecting something we don't
>> promise.
>>
>> I agree on this. Enforcing users to look up documentation for the correct
>> way is better than letting them use an ambiguous way that could fail their
>> expectation.
>>
>>
>> -Rui
>>
>> On Wed, Aug 15, 2018 at 1:46 PM Anton Kedin  wrote:
>>
>>> I think that something unique along the lines of `REGISTER EXTERNAL DATA
>>> SOURCE` is probably fine, as it doesn't conflict with existing behaviors of
>>> other dialects.
>>>
>>> > There is a lot of value in making sure our common operations closely
>>> map to the equivalent common operations in other SQL dialects.
>>>
>>> We're trying to make opposite points using the same arguments :) A lot
>>> of popular dialects make difference between CREATE TABLE and CREATE
>>> EXTERNAL TABLE (or similar):
>>>  - T-SQL:
>>>   create:
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
>>>   create external:
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
>>>   external datasource:
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
>>>  - PL/SQL:
>>>   create:
>>> https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
>>>   create external:
>>> https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
>>>  - postgres:
>>>   import foreign schema:
>>> https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
>>>   create table:
>>> https://www.postgresql.org/docs/9.1/static/sql-createtable.html
>>>  - redshift:
>>>   create external schema:
>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
>>>   create table:
>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
>>>  - hive internal and external:
>>> https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
>>>
>>> My understanding is that the behavior of create table is somewhat
>>> similar in all of the above dialects, from the high-level perspective it
>>> usually creates a persistent table in the current storage context
>>> (database). That's not what Beam SQL's create table does right now, and my
>>> opinion is that it should not be called create table for this reason.
>>>
>>> >  I think users will be more confused to find that 'CREATE TABLE'
>>> doesn't exist then to learn that it might not always create a table.
>>>
>>> I think that having CREATE TABLE do something unexpected or not do
>>> something expected (or do the opposite things depending on the table type
>>> or some flag) is worse than having users look up the correct way of
>>> creating a data source in Beam SQL without expecting something we don't
>>> promise.
>>>
>>> >  (For example, a user guessing at the syntax of CREATE TABLE would
>>> have a better experience with the error being "field LOCATION not
>>> specified" rather than "operation CREATE TABLE not found".)
>>>
>>> They have to look it up anyway (what format is location for a Pubsub
>>> topic? or is it a subscription?), and when doing so I think it would be
>>> less confusing to read that to get data from Pubsub/Kafka/... in

Re: Nexmark pseudo code in the wiki

2018-08-17 Thread Anton Kedin

Thank you!
On Fri, Aug 17, 2018 at 9:44 AM Thomas Weise  wrote:

> Anton, you should be all set.
>
> On Fri, Aug 17, 2018 at 9:11 AM Anton Kedin  wrote:
>
>> Sure, I can do that.
>> Can someone give me permissions?
>>
>> Thank you,
>> Anton
>>
>> On Fri, Aug 17, 2018 at 12:32 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi Anton,
>>>
>>> I was hoping you would say that. Actually I hesitated to add SQL-Nexmark
>>> and I thought you were more suited to describe it :)
>>>
>>> Thanks
>>> Etienne
>>>
>>> Le jeudi 16 août 2018 à 09:10 -0700, Anton Kedin a écrit :
>>>
>>> This is nice! Thank you for publishing this!
>>>
>>> The only thing I would add is the pseudo-SQL versions of the queries,
>>> similar to how they're described in the original Nexmark paper.
>>>
>>> Regards,
>>> Anton
>>>
>>> On Thu, Aug 16, 2018 at 5:57 AM Etienne Chauchot 
>>> wrote:
>>>
>>> Hi guys,
>>>
>>> I've also created a page on the contributors wiki for nexmark
>>> Indeed, some queries can be very complex. To ease their maintenance, I
>>> created a page that presents the architecture along with pseudo-code of the
>>> queries:
>>>
>>> https://cwiki.apache.org/confluence/display/BEAM/Nexmark+code
>>>
>>> Comment welcome !
>>>
>>> Etienne
>>>
>>>

Re: Nexmark pseudo code in the wiki

2018-08-17 Thread Anton Kedin

Sure, I can do that.
Can someone give me permissions?

Thank you,
Anton

On Fri, Aug 17, 2018 at 12:32 AM Etienne Chauchot 
wrote:

> Hi Anton,
>
> I was hoping you would say that. Actually I hesitated to add SQL-Nexmark
> and I thought you were more suited to describe it :)
>
> Thanks
> Etienne
>
> Le jeudi 16 août 2018 à 09:10 -0700, Anton Kedin a écrit :
>
> This is nice! Thank you for publishing this!
>
> The only thing I would add is the pseudo-SQL versions of the queries,
> similar to how they're described in the original Nexmark paper.
>
> Regards,
> Anton
>
> On Thu, Aug 16, 2018 at 5:57 AM Etienne Chauchot 
> wrote:
>
> Hi guys,
>
> I've also created a page on the contributors wiki for nexmark
> Indeed, some queries can be very complex. To ease their maintenance, I
> created a page that presents the architecture along with pseudo-code of the
> queries:
>
> https://cwiki.apache.org/confluence/display/BEAM/Nexmark+code
>
> Comment welcome !
>
> Etienne
>
>

Re: Nexmark pseudo code in the wiki

2018-08-16 Thread Anton Kedin

This is nice! Thank you for publishing this!

The only thing I would add is the pseudo-SQL versions of the queries,
similar to how they're described in the original Nexmark paper.

Regards,
Anton

On Thu, Aug 16, 2018 at 5:57 AM Etienne Chauchot 
wrote:

> Hi guys,
>
> I've also created a page on the contributors wiki for nexmark
> Indeed, some queries can be very complex. To ease their maintenance, I
> created a page that presents the architecture along with pseudo-code of the
> queries:
>
> https://cwiki.apache.org/confluence/display/BEAM/Nexmark+code
>
> Comment welcome !
>
> Etienne
>

Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

2018-08-15 Thread Anton Kedin

I think that something unique along the lines of `REGISTER EXTERNAL DATA
SOURCE` is probably fine, as it doesn't conflict with existing behaviors of
other dialects.

> There is a lot of value in making sure our common operations closely map
to the equivalent common operations in other SQL dialects.

We're trying to make opposite points using the same arguments :) A lot of
popular dialects make difference between CREATE TABLE and CREATE EXTERNAL
TABLE (or similar):
 - T-SQL:
  create:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
  create external:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
  external datasource:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
 - PL/SQL:
  create:
https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
  create external:
https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
 - postgres:
  import foreign schema:
https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
  create table:
https://www.postgresql.org/docs/9.1/static/sql-createtable.html
 - redshift:
  create external schema:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
  create table:
https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
 - hive internal and external:
https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables

My understanding is that the behavior of create table is somewhat similar
in all of the above dialects, from the high-level perspective it usually
creates a persistent table in the current storage context (database).
That's not what Beam SQL's create table does right now, and my opinion is
that it should not be called create table for this reason.

>  I think users will be more confused to find that 'CREATE TABLE' doesn't
exist then to learn that it might not always create a table.

I think that having CREATE TABLE do something unexpected or not do
something expected (or do the opposite things depending on the table type
or some flag) is worse than having users look up the correct way of
creating a data source in Beam SQL without expecting something we don't
promise.

>  (For example, a user guessing at the syntax of CREATE TABLE would have a
better experience with the error being "field LOCATION not specified"
rather than "operation CREATE TABLE not found".)

They have to look it up anyway (what format is location for a Pubsub topic?
or is it a subscription?), and when doing so I think it would be less
confusing to read that to get data from Pubsub/Kafka/... in Beam SQL you
have to do something like `REGISTER EXTERNAL DATA SOURCE` than `CREATE
TABLE`.

External tables and schemas don't have a standard approach and I don't have
a strong preference between any one from the above.

On Wed, Aug 15, 2018 at 1:08 PM Rui Wang  wrote:

> Adding dev@ back now.
>
> -Rui
>
> On Wed, Aug 15, 2018 at 1:01 PM Andrew Pilloud 
> wrote:
>
>> Did we drop the dev list from this on purpose? (I haven't added it back,
>> but we probably should.)
>>
>> I'm in favor of sticking with the simple 'CREATE TABLE' and 'CREATE
>> SCHEMA' if there is only to be one option. Sticking with those names
>> minimizes both our deviation from other implementations and user surprise.
>> There is a lot of value in making sure our common operations closely map to
>> the equivalent common operations in other SQL dialects. I think users will
>> be more confused to find that 'CREATE TABLE' doesn't exist then to learn
>> that it might not always create a table. This minimizes the overhead of
>> learning our dialect of SQL and maximizes the odds that a user will be able
>> to guess at the syntax of something and have it work. (For example, a user
>> guessing at the syntax of CREATE TABLE would have a better experience with
>> the error being "field LOCATION not specified" rather than "operation
>> CREATE TABLE not found".)
>>
>> If the goal is clarity of the operation, how about 'REGISTER EXTERNAL DATA
>> SOURCE' and 'REGISTER EXTERNAL DATA SOURCE PROVIDER'? Those names remove
>> the ambiguity around the operation creating and the data source being a
>> table.
>>
>> Andrew
>>
>> On Wed, Aug 15, 2018 at 10:54 AM Anton Kedin  wrote:
>>
>>> My preference is to make `EXTERNAL` mandatory and only support `CREATE
>>> EXTERNAL TABLE` for existing semantics. My main reasons are:
>>>  - user friendliness, matching expectations, readability. Current
>>> `CREATE TABLE` is basically a `CREATE EXTERNAL TABLE`. It is confusing to
>>> users familiar with SQL who expect

Re: How do we run pipeline using gradle?

2018-08-15 Thread Anton Kedin

Huygaa,

Not sure about existing options for WordCount specifically, but nothing
stops us from having it. In SQL we have a couple of tasks to simplify
launching the examples:

https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/build.gradle#L149

https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/example/BeamSqlPojoExample.java#L41

Something like this can be easily generalized and parametrized if we need
it.

Regards,
Anton


On Wed, Aug 15, 2018 at 10:54 AM Robin Qiu  wrote:

> Hey Huygaa,
>
> If you have your Intellij set up, you can run it from there and edit
> program arguments in "Run Configuration".
>
> Best,
> Robin
>
> On Wed, Aug 15, 2018 at 10:50 AM Huygaa Batsaikhan 
> wrote:
>
>> When we run wordcount using maven, we pass "-P dataflow-runner" profile
>> to set the runner. What is the equivalent of this in gradle? In other
>> words, how can I run wordcount straight from my Beam repo code?
>>
>

Re: Policy for Python ValidatesRunner vs IT tests?

2018-08-14 Thread Anton Kedin

IT tests exist in java, similar to unit tests and not marked in a special
way, except they're called *IT.java instead of *Test.java. They're run from
corresponding tasks:
 -
https://github.com/apache/beam/blob/d6c5bf977fc688f289f1bb06e30f25b05bf987b2/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadIT.java#L33

 -
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/build.gradle#L85

On Tue, Aug 14, 2018 at 3:27 PM Pablo Estrada  wrote:

> Hello,
> In Python, we tag some test methods with @attr('ValidatesRunner') and
> @attr('IT'), which marks them to be run as pipeline tests.
>
> If I understand correctly:
> - ValidatesRunner tests are more like a component tests[1] as explained in
> Beam docs
> - IT tests are more like a E2E test[2] as explained in the docs. Is there
> an equivalent in Java?
> - Finally, there's ValidatesContainer tests. What are these for? What's
> the guidance for tagging our tests this way?
>
> Thanks!
> -P.
>
> [1] https://beam.apache.org/contribute/testing/#validatesrunner
> [2] https://beam.apache.org/contribute/testing/#e2e
> --
> Got feedback? go/pabloem-feedback
> 
>

Re: [SQL] Create External Schema

2018-08-13 Thread Anton Kedin

It should be. I briefly looked at it and it seems possible to use
SchemaRegistry for this, although it will need some additional wiring for
named schemas.

Regards,
Anton


On Mon, Aug 13, 2018 at 4:14 PM Reuven Lax  wrote:

> Is it possible to extend Beam's SchemaRegistry to do this?
>
> On Mon, Aug 13, 2018 at 4:06 PM Anton Kedin  wrote:
>
>> Hi,
>>
>> I am planning to work on implementing a support for external schema
>> providers for Beam SQL and wanted to share a high level idea how I think
>> this can work.
>>
>> *Short Version*
>> Implement CREATE FOREIGN SCHEMA statement:
>>
>> CREATE FOREIGN SCHEMA
>>
>>  TYPE 'bigquery'
>>
>>  LOCATION 'dataset_example'
>>
>>  AS bq;
>>
>> CREATE FOREIGN SCHEMA
>>
>>  TYPE 'hcatalog'
>>
>>  LOCATION 'hive-server:2341'
>>
>>  AS hive;
>>
>> SELECT *
>>
>>  FROM
>>
>>   bq.table_example_bq AS bq_table1
>>
>> JOIN
>>
>>   hive.table_example_hive AS hive_table1
>>
>> ON
>>   bq_table1.some_field = hive_table1.some_other_field;
>>
>> *A Bit Longer Version: *
>> https://docs.google.com/document/d/1Ilk3OpDxrp3bHNlcnYDoj29tt9bd1E0EXt8i0WytNmQ
>>
>> Thoughts, ideas?
>>
>> Regards,
>> Anton
>>
>

[SQL] Create External Schema

2018-08-13 Thread Anton Kedin

Hi,

I am planning to work on implementing a support for external schema
providers for Beam SQL and wanted to share a high level idea how I think
this can work.

*Short Version*
Implement CREATE FOREIGN SCHEMA statement:

CREATE FOREIGN SCHEMA

 TYPE 'bigquery'

 LOCATION 'dataset_example'

 AS bq;

CREATE FOREIGN SCHEMA

 TYPE 'hcatalog'

 LOCATION 'hive-server:2341'

 AS hive;

SELECT *

 FROM

  bq.table_example_bq AS bq_table1

JOIN

  hive.table_example_hive AS hive_table1

ON
  bq_table1.some_field = hive_table1.some_other_field;

*A Bit Longer Version: *
https://docs.google.com/document/d/1Ilk3OpDxrp3bHNlcnYDoj29tt9bd1E0EXt8i0WytNmQ

Thoughts, ideas?

Regards,
Anton

Re: Schema Aware PCollections

2018-08-08 Thread Anton Kedin

Yes, this should be possible eventually. In fact, limited version of this
functionality is already supported for Beans (e.g. see this test
),
but it's still experimental and there are no good end-to-end examples yet.

Regards,
Anton

On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B <
akanksha.b.sha...@ericsson.com> wrote:

> Hi,
>
>
> (changed the email-subject to make it generic)
>
>
> It is mentioned in Schema-Aware PCollections design doc (
> https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc
> )
>
>
> "There are a number of existing data types from which schemas can be
> inferred. Protocol buffers, Avro objects, Json objects, POJOs, primitive
> Java types - all of these have schemas that can be inferred from the type
> itself at pipeline-construction time. We should be able to automatically
> infer these schemas with a minimum of involvement from the programmer. "
>
> Can I assume that the following usecase will be possible sometime in
> future :-
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema.""
>
> Regards,
> Akanksha
>
> --
> *From:* Chamikara Jayalath 
> *Sent:* Wednesday, August 1, 2018 3:57 PM
> *To:* u...@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
>
> On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B <
> akanksha.b.sha...@ericsson.com> wrote:
>
> Hi,
>
>
> Thanks. I understood the Parquet point. I will wait for couple of days on
> this topic. Even if this scenario cannot be achieved now, any design
> document or future plans towards this direction will also be helpful to me.
>
>
> To summarize, I do not understand beam well enough, can someone please
> help me and comment whether the following fits with beam's model and
> future direction :-
>
> "read parquet (along with inferred schema) into something like dataframe
> or Beam Rows. And vice versa for write i.e. get rows and write parquet
> based on Row's schema."
>
>
> Beam currently does not have a standard message format. A Beam pipeline
> consists of PCollections and transforms (that converts PCollections to
> other PCollections). You can transform the PCollection read from Parquet
> using a ParDo and writing the resulting transform back to Parquet format. I
> think Schema aware PCollections [1] might be close to what you need but not
> sure if it fulfills your exact requirement.
>
> Thanks,
> Cham
>
> [1]
> https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E
>
>
>
>
>
> Regards,
>
> Akanksha
>
>
> --
> *From:* Łukasz Gajowy 
> *Sent:* Tuesday, July 31, 2018 12:43:32 PM
> *To:* u...@beam.apache.org
> *Cc:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
> In terms of schema and ParquetIO source/sink, there was an answer in some
> previous thread:
>
> Currently (without introducing any change in ParquetIO) there is no way to
> not pass the avro schema. It will probably be replaced with Beam's schema
> in the future ()
>
> [1]
> https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E
>
>
> wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
> napisał(a):
>
> Hi,
>
>
> I am hoping to get some hints/pointers from the experts here.
>
> I hope the scenario described below was understandable. I hope it is a
> valid use-case. Please let me know if I need to explain the scenario
> better.
>
>
> Regards,
>
> Akanksha
>
> --
> *From:* Akanksha Sharma B
> *Sent:* Friday, July 27, 2018 9:44 AM
> *To:* dev@beam.apache.org
> *Subject:* Re: pipeline with parquet and sql
>
>
> Hi,
>
>
> Please consider following pipeline:-
>
>
> Source is Parquet file, having hundreds of columns.
>
> Sink is Parquet. Multiple output parquet files are generated after
> applying some sql joins. Sql joins to be applied differ for each output
> parquet file. Lets assume we have a sql queries generator or some
> configuration file with the needed info.
>
>
> Can this be implemented generically, such that there is no need of the
> schema of the parquet files involved or any intermediate POJO or beam
> schema.
>
> i.e. the way spark can handle it - read parquet into dataframe, create
> temp view and apply sql queries to it, and write it back to parquet.
>
> As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO needs
> avro schemas. Ideally we dont want to see POJOs or schemas.
> If there is a way we can achieve this with beam, please do help.
>
> Regards,
> Akanksha
>
> --
> *From:* Akanksha Sharma B
> *Sent:* Tuesday, July 24, 2018 4:47:25 PM
> *To:*

Re: [Vote] Dev wiki engine

2018-07-19 Thread Anton Kedin

+1 for Confluence


On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud  wrote:

> +1 Apache Confluence
>
> Because .md files in code repo require code review and commit.
>
> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin  wrote:
>
>> Hi everyone,
>>
>> There is a long lasting discussion on starting Beam Dev Wiki
>> 
>> ongoing. Seems that the only question remaining is to decide on what engine
>> to use for wiki. So far it seems that we have two suggestions: confluence
>> and .md files in repo.
>>
>> Quick summary can also be found in following doc
>> 
>> .
>>
>> I suggest to vote on which approach to use:
>> 1. Apache Confluence
>> 2. .md files in code repository (Those can be rendered by Github)
>>
>> --Mikhail
>>
>>

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-18 Thread Anton Kedin

These dashboards look great!

Can publish the links to the dashboards somewhere, for better visibility?
E.g. in the jenkins website / emails, or the wiki.

Regards,
Anton

On Wed, Jul 18, 2018 at 10:08 AM Andrew Pilloud  wrote:

> Hi Etienne,
>
> I've been asking around and it sounds like we should be able to get a
> dedicated Jenkins node for performance tests. Another thing that might help
> is making the runs a few times longer. They are currently running around 2
> seconds each, so the total time of the build probably exceeds testing.
> Internally at Google we are running them with 2000x as many events on
> Dataflow, but a job of that size won't even complete on the Direct Runner.
>
> I didn't see the query 3 issues, but now that you point it out it looks
> like a bug to me too.
>
> Andrew
>
> On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot 
> wrote:
>
>> Hi Andrew,
>>
>> Yes I saw that, except dedicating jenkins nodes to nexmark, I see no
>> other way.
>>
>> Also, did you see query 3 output size on direct runner? Should be a
>> straight line and it is not, I'm wondering if there is a problem with sate
>> and timers impl in direct runner.
>>
>> Etienne
>>
>> Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
>>
>> I'm noticing the graphs are really noisy. It looks like we are running
>> these on shared Jenkins executors, so our perf tests are fighting with
>> other builds for CPU. I've opened an issue
>> https://issues.apache.org/jira/browse/BEAM-4804 and am wondering if
>> anyone knows an easy fix to isolate these jobs.
>>
>> Andrew
>>
>> On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
>>
>> @Etienne: Nice to see the graphs! :)
>>
>> @Ismael: Good idea, there's no document yet. I think we could create a
>> small google doc with instructions on how to do this.
>>
>> pt., 13 lip 2018 o 10:46 Etienne Chauchot 
>> napisał(a):
>>
>> Hi,
>>
>> @Andrew, this is because I did not find a way to set 2 scales on the Y
>> axis on the perfkit graphs. Indeed numResults varies from 1 to 100 000 and
>> runtimeSec is usually bellow 10s.
>>
>> Etienne
>>
>> Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
>>
>> This is great, should make performance work much easier! I'm going to get
>> the Beam SQL Nexmark jobs publishing as well. (Opened
>> https://issues.apache.org/jira/browse/BEAM-4774 to track.) I might take
>> on the Dataflow runner as well if no one else volunteers.
>>
>> I am curious as to why you have two separate graphs for runtime and count
>> rather then graphing runtime/count to get the throughput rate for each run?
>> Or should that be a third graph? Looks like it would just be a small tweak
>> to the query in perfkit.
>>
>>
>>
>> Andrew
>>
>> On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada 
>> wrote:
>>
>> This is really cool Etienne : ) thanks for working on this.
>> Our of curiosity, do you know how often the tests run on each runner?
>>
>> Best
>> -P.
>>
>> On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
>> wrote:
>>
>> Awesome Etienne, this is really important for the (user) community to
>> have that visibility since it is one of the most important aspect of the
>> Beam's quality, kudo!
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>>
>> Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré  a
>> écrit :
>>
>> It's really great to have these dashboards and integration in Jenkins !
>>
>> Thanks Etienne for driving this !
>>
>> Regards
>> JB
>>
>> On 11/07/2018 15:13, Etienne Chauchot wrote:
>> >
>> > Hi guys,
>> >
>> > I'm glad to announce that the CI of Beam has much improved ! Indeed
>> > Nexmark is now included in the perfkit dashboards.
>> >
>> > At each commit on master, nexmark suites are run and plots are created
>> > on the graphs.
>> >
>> > I've created 2 kind of dashboards:
>> > - one for performances (run times of the queries)
>> > - one for the size of the output PCollection (which should be constant)
>> >
>> > There are dashboards for these runners:
>> > - spark
>> > - flink
>> > - direct runner
>> >
>> > Each dashboard contains:
>> > - graphs in batch mode
>> > - graphs in streaming mode
>> > - graphs for the 13 queries.
>> >
>> > That gives more than a hundred of graphs (my right finger hurts after so
>> > many clics on the mouse :) ). It is detailed that much so that anyone
>> > can focus on the area they have interest in.
>> > Feel free to also create new dashboards with more aggregated data.
>> >
>> > Thanks to Lukasz and Cham for reviewing my PRs and showing how to use
>> > perfkit dashboards.
>> >
>> > Dashboards are there:
>> >
>> >
>> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
>> >
>>

Re: Permissions for confluence

2018-07-13 Thread Anton Kedin

I may be mistaken, but there was no final conclusion reached, so probably
guidance from PMC will be needed where specifically to put things. I
personally think that this kind of documentation is a right thing to put
under cwiki/contributors.

>From the last thread I think Kenn, Jean-Baptiste (j...@nanthrax.net), and
Daniel (dk...@apache.org) had permissions.

On Fri, Jul 13, 2018 at 4:28 PM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> I believe that last month we decided to keep developers documentation on
> Apache confluence wiki (https://cwiki.apache.org/confluence/display/BEAM).
>
> I have some documentation regarding spinning up Jenkins via Docker.
>
> Can you help me with following:
> 1) Is confluence a correct place to put this documentation?
> 2) How do I get edit permissions for confluence?
>
> Regards,
> --Mikhail
>
> Have feedback ?
>

Re: Automatically create JIRA tickets for failing post-commit tests

2018-07-11 Thread Anton Kedin

I think this looks good, we should enable the plugin and try it out.
Concrete details of the follow-up tasks (auto-assignment, triage, and
dashboarding) will probably depend on how functional the plugin is and what
the test failures data looks like.

Regards,
Anton

On Wed, Jul 11, 2018 at 5:00 PM Mikhail Gryzykhin  wrote:

> @Yifan Zou 
>
> I believe that we should test-drive the system with tickets + PR first and
> decide on email notification later. We already have tests failure emails
> sent to commits@, I believe most people filter out or not signed up for
> that list though.
>
> It creates only one ticket, and keeps it for recurring test failures.
>
> @Andrew Pilloud 
> Thank you for the suggestion. I'll add it to design doc.
>
> --Mikhail
>
>
>
> On Wed, Jul 11, 2018 at 4:52 PM Yifan Zou  wrote:
>
>> +1 to Andrew's concerns. Leaving the tickets unassigned will cause the
>> ticket being ignored and no actions being taken.
>>
>> I can see the challenges on ticket assignment. Like Mikhail mentioned,
>> the plugin does not support dynamic assignments. We have to implement
>> custom script to determine the assignees and do some tricks to the jenkins
>> job. Also, the post-commits tests usually cover tons of stuffs that it is
>> difficult to find which part was broken and ask the right person to look
>> into within the Auto JIRA process. Some naive thoughts: Are we able to send
>> emails to the dev@ to ask people to take care of the JIRA issues? Are we
>> able to find component leads and ask them triage the test failure tickets?
>>
>> Another nitpick comment. Does the jenkins job file the JIRA issue in
>> every test failure? Sometimes the test continuously fails in a time period
>> due to the same reason. In this case, we will get some duplicate issues
>> filed by Jenkins. I think it could be better if we can avoid filing issues
>> if the previous one has not been resolved.
>>
>> Thanks.
>> Yifan
>>
>>
>> On Wed, Jul 11, 2018 at 4:37 PM Andrew Pilloud 
>> wrote:
>>
>>> That sounds great. You should add this detail to the doc.
>>>
>>> On Wed, Jul 11, 2018 at 4:29 PM Mikhail Gryzykhin 
>>> wrote:
>>>
 We already have component for this purpose: "test-failures". All
 tickets created will go to that component. As an option, we can add link to
 view list of open JIRA tickets to PR template.

 We also would want to create graph on dashboard with amount of
 unassigned and assigned bugs.

 I believe that we can also add counter of unassigned bugs to PR
 template. This way it will be easier for everyone to know when there's some
 tests issue not attended.

 --Mikhail


 On Wed, Jul 11, 2018 at 4:24 PM Andrew Pilloud 
 wrote:

> So it sounds like you will want to create a component for untriaged
> issues so they are easy to find. I like the idea of distributing the work
> of triaging post commit failures to new PR authors as a condition of
> merging. I feel like we will just be filling JIRA with spam if the issues
> are automatically created without a plan for triage.
>
> Andrew
>
> On Wed, Jul 11, 2018 at 4:12 PM Rui Wang  wrote:
>
>> Maybe this is also a good thread to start the discussion that if we
>> want to enforce postcommit test for every PR.
>>
>> Can we afford the cost of longer waiting time to catch potential
>> bugs?
>>
>> -Rui
>>
>> On Wed, Jul 11, 2018 at 4:04 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> That's a valid point.
>>>
>>> Unfortunately, the JiraTestResultReporter plugin does not have
>>> features to dynamically assign owners. Additionally, I don't think it is
>>> always easy to find proper owner for post-commit tests at first glance,
>>> since they usually cover broad specter of issues.
>>>
>>> My assumption is that we need someone to triage new issues.
>>>
>>> Ideally, any contributor, who sees failing test, should check
>>> unassigned tickets and either do triage, or assign them to someone who 
>>> can.
>>> I strongly encourage this approach.
>>>
>>> We have couple other ready-made options to consider:
>>> 1. We can configure JIRA component owner who would be assigned to
>>> created tickets.
>>> 2. JiraTestReporterPlugin can assign tickets to specific user. This
>>> is configured per Jenkins job. We can utilize this if someone 
>>> volunteers.
>>> 3. Dynamic assignment will most likely require custom solution.
>>>
>>> --Mikhail
>>>
>>>
>>> On Wed, Jul 11, 2018 at 3:34 PM Andrew Pilloud 
>>> wrote:
>>>
 Hi Mikhail,

 I like the proposal! Hopefully this can replace the constant stream
 of build failure emails. I noticed one detail seems to be missing:  How
 will new issues be assigned to the proper owner? Will the tool do this
 automatically or will we need someone to triage new issues?

Re: Building and visualizing the Beam SQL graph

2018-06-13 Thread Anton Kedin

>From the visualization perspective I really loved the interactive runner
demo where it shows the graph:
https://www.youtube.com/watch?v=c5CjA1e3Cqw=27s

On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles  wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud 
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform> buildPTransform();
>>
>> default PCollection toPCollection(Pipeline pipeline) {
>> return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu  wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud 
>>> wrote:
>>>
>>>> That sounds correct. And because each rel node might have a different
>>>> input there isn't a standard interface (like PTransform<
>>>> PCollection, PCollection> toPTransform());
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>>>>
>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>> there are some criteria in this case that might apply in other cases:
>>>>>
>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>> how it computes its result from its inputs.
>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>> rel node!
>>>>>
>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>>>>>
>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>> the SQL use case?
>>>>>>
>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>  - how runner executes it;
>>>>>>  - what it looks like when constructed;
>>>>>>  - what the user was describing in DSL;
>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>> DSLs.
>>>>>>
>>>>>> If possible, we probably should be able to map any level of
>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> In other words, revert
>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>>>>>> I agree :-)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>> recursively building a tree of nested PTransforms. This results in a 
>>>>>>>> weird
>>>>>>>> nested graph in the dataflow UI where each node contains its inputs 
>>>>>>>> nested
>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>> converting the tree from a PTransform to a PCollection, which will 
>>>>>>>> result
>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>> should simplify the code as well. This will not change the public 
>>>>>>>> interface
>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>
>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>
>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>

Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Anton Kedin

Not answering the original question, but doesn't "explain" satisfy the SQL
use case?

Going forward we probably want to solve this in a more general way. We have
at least 3 ways to represent the pipeline:
 - how runner executes it;
 - what it looks like when constructed;
 - what the user was describing in DSL;
And there will probably be more, if extra layers are built on top of DSLs.

If possible, we probably should be able to map any level of abstraction to
any other to better understand and debug the pipelines.


On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:

> In other words, revert https://github.com/apache/beam/pull/4705/files, at
> least in spirit? I agree :-)
>
> Kenn
>
> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
> wrote:
>
>> We are currently converting the Calcite Rel tree to Beam by recursively
>> building a tree of nested PTransforms. This results in a weird nested graph
>> in the dataflow UI where each node contains its inputs nested inside of it.
>> I'm going to change the internal data structure for converting the tree
>> from a PTransform to a PCollection, which will result in a more accurate
>> representation of the tree structure being built and should simplify the
>> code as well. This will not change the public interface to SQL, which will
>> remain a PTransform. Any thoughts or objections?
>>
>> I was also wondering if there are tools for visualizing the Beam graph
>> aside from the dataflow runner UI. What other tools exist?
>>
>> Andrew
>>
>

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-08 Thread Anton Kedin

+1

(a) we should;
(b) I think it will be a good place for all of the things you list;
(c) introductory things, like "getting started", or "programming guide"
that people not deeply involved in the project would expect to find on
beam.apache.org should stay there, not in the wiki;

On Fri, Jun 8, 2018 at 12:56 AM Etienne Chauchot 
wrote:

> Hi Kenn,
> I'm +1 of course. I remember that you and I and others discussed in a
> similar thread about dev facing docs but it got lost at some point in time.
> IMHO
>
> I would add
> - runners specifics (e.g. how runners implement state or timer, how they
> split data into bundles, etc...)
> - probably putting online the doc I did for nexmark that summarizes the
> architecture and pseudo code of the queries (because some are several
> thousand lines of code). I often use it to recall what a given query does.
>
> I would remove
>  - BIPs / summaries of collections of JIRA
> because it is hard to maintain and will end up being out of date I think.
>
> Etienne
>
> Le jeudi 07 juin 2018 à 13:23 -0700, Kenneth Knowles a écrit :
>
> Hi all,
>
> I've been in half a dozen conversations recently about whether to have a
> wiki and what to use it for. Some things I've heard:
>
>  - "why is all this stuff that users don't care about here?"
>  - "can we have a lighter weight place to put technical references for
> contributors"
>
> So I want to consider as a community starting up our wiki. Ideas for what
> could go there:
>
>  - Collection of links to design docs like
> https://beam.apache.org/contribute/design-documents/
>  - Specialized walkthroughs like
> https://beam.apache.org/contribute/docker-images/
>  - Best-effort notes that just try to help out like
> https://beam.apache.org/contribute/intellij/
>  - Docs on in-progress stuff like
> https://beam.apache.org/documentation/runners/jstorm/
>  - Expanded instructions for committers, more than
> https://beam.apache.org/contribute/committer-guide/
>  - BIPs / summaries of collections of JIRA
>  - Docs sitting in markdown in the repo like
> https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md and
> https://github.com/apache/beam-site/blob/asf-site/README.md (which will
> soon not be a toplevel README)
>
> What do you think?
>
> (a) should we do it?
> (b) what should go there?
> (c) what should not go there?
>
> Kenn
>
>

Re: [SQL] Unsupported features

2018-06-01 Thread Anton Kedin

This looks very helpful, thank you.

Can you file Jiras for the major problems? Or maybe a single jira for the
whole thing with sub-tasks for specific problems.

Regards,
Anton

On Wed, May 30, 2018 at 9:12 AM Kenneth Knowles  wrote:

> This is extremely useful. Thanks for putting so much information together!
>
> Kenn
>
> On Wed, May 30, 2018 at 8:19 AM Kai Jiang  wrote:
>
>> Hi all,
>>
>> Based on pull/5481 , I
>> manually did a coverage test with TPC-ds queries (65%) and TPC-h queries
>> (100%) and want to see what features Beam SQL is currently not supporting.
>> Test was running on DirectRunner.
>>
>> I want to share the result.
>>  TPC-DS queries on Beam
>> 
>> 
>> TL;DR:
>>
>>1. aggregation function (stddev) missing or calculation of
>>aggregation functions combination.
>>2. nested beamjoinrel(condition=[true], joinType=[inner]) / cross
>>join error
>>3. date type casting/ calculation and other types casting.
>>4. LIKE operator in String / alias for substring function
>>5. order by w/o limit clause.
>>6. OR operator is supported in join condition
>>7. Syntax: exist/ not exist (errors) .rank() over (partition by)
>>/ view (unsupported)
>>
>>
>> Best,
>> Kai
>> ᐧ
>>
>

Re: [ANNOUNCEMENT] New committers, May 2018 edition!

2018-05-31 Thread Anton Kedin

Congrats!

On Thu, May 31, 2018 at 7:29 PM Kenneth Knowles  wrote:

> Huzzah!
>
> On Thu, May 31, 2018 at 7:27 PM Ahmet Altay  wrote:
>
>> Congratulations to all of you!
>>
>> On Thu, May 31, 2018 at 7:26 PM, Chamikara Jayalath > > wrote:
>>
>>> Congrats to all three!!
>>>
>>> On Thu, May 31, 2018 at 7:09 PM Davor Bonaci  wrote:
>>>
 Please join me and the rest of Beam PMC in welcoming the following
 contributors as our newest committers. They have significantly contributed
 to the project in different ways, and we look forward to many more
 contributions in the future.

 * Griselda Cuevas
 * Pablo Estrada
 * Jason Kuster

 (Apologizes for a delayed announcement, and the lack of the usual
 paragraph summarizing individual contributions.)

 Congratulations to all three! Welcome!

>>>
>>

Re: Java code under main depends on junit?

2018-05-17 Thread Anton Kedin

Opened PR  to fix the current
build issue, opened BEAM-4358
 to extract test
dependencies.

Should we keep maven precommits running for now if we have to fix the
issues like these? In the PR I had to fix another issue in the same
project, and I suspect other projects are broken for me for similar reasons.

Regards,
Anton

On Thu, May 17, 2018 at 4:52 PM Kenneth Knowles  wrote:

> I know what you mean. But indeed, test artifacts are unsuitable to depend
> on since transitive deps don't work correctly. I think it makes sense to
> have a separate test utility. For the core, one reason we didn't was to
> have PAssert available in main. But now that we have Gradle we actually can
> do that because it is not a true cycle but a false cycle introduced by
> maven.
>
> For GCP it is even easier.
>
> Kenn
>
>
> On Thu, May 17, 2018, 16:28 Thomas Weise  wrote:
>
>> It is possible to depend on a test artifact to achieve the same, but
>> unfortunately not transitively.
>>
>> Mixing test utilities into the main artifacts seems undesirable, since
>> they are only needed for tests. It may give more food to the shading
>> monster also..
>>
>> So it is probably better to create a dedicated test tools artifact that
>> qualifies as transitive dependency?
>>
>> Thanks
>>
>>
>> On Thu, May 17, 2018 at 4:17 PM, Kenneth Knowles  wrote:
>>
>>> This seems correct. Test jars are for tests. Utilities to be used for
>>> tests need to be in main jars. (If for no other reason, this is how
>>> transitive deps work)
>>>
>>> We've considered putting these things in a separate package (still in
>>> main). Just no one has done it.
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018, 16:04 Thomas Weise  wrote:
>>>
 Hi,

 Is the following dependency intended or an oversight?


 https://github.com/apache/beam/blob/06c70bdf871c5da8a115011b43f8072916cd79e8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/TestPubsub.java#L32

 It appears that dependent code is in test scope.

 Should the build flag this (the maven build fails)?

 Thanks


>>

Re: Java code under main depends on junit?

2018-05-17 Thread Anton Kedin

My fault, I'll fix the maven issue.

I added this file and it is not in test intentionally. The purpose of this
class is similar to TestPipeline, in that other packages which depend on
GCP IO can use this class in tests, including integration tests. For
example, right now Beam SQL project depends on GCP IO project and uses both
TestPipeline and TestPubsub in the integration tests. Is there a better
approach for such use case?

Regards,
Anton

On Thu, May 17, 2018 at 4:04 PM Thomas Weise  wrote:

> Hi,
>
> Is the following dependency intended or an oversight?
>
>
> https://github.com/apache/beam/blob/06c70bdf871c5da8a115011b43f8072916cd79e8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/TestPubsub.java#L32
>
> It appears that dependent code is in test scope.
>
> Should the build flag this (the maven build fails)?
>
> Thanks
>
>

Re: JDBC support for Beam SQL

2018-05-16 Thread Anton Kedin

Among these options I would lean towards option 1. We already support a lot
of infrastructure to call into Calcite for non-JDBC path, so adding some
code to generate config does not seem like a big of a deal, especially if
it will be a supported way at some point in Calcite.

Pulling implementation RelNode out of JDBC seems to bring a lot more
unknowns:
 - it feels it goes against the JDBC approach as we're basically going
around JDBC result sets;
 - we will expose 2 ways to extract results, with different schemas, types,
etc;

I think the third option is to implement the JDBC driver ourselves without
using Calcite JDBC infrastructure. This way we have the only path into
Calcite and control everything. I don't know how much effort it would take
to implement a functional JDBC to cover our use cases though, but I think
it's on a similar order of magnitude as we don't have to implement a lot of
the API in the beginning, e.g. transactions, cursors, DML.

On Wed, May 16, 2018 at 10:15 AM Kenneth Knowles  wrote:

> IIUC in #2 Beam SQL would live on the other side of a JDBC boundary from
> any use of it (including the BeamSQL transform). I'm a bit worried we'll
> have a problem plumbing all the info we need, either now or later,
> especially if we make funky extensions to support our version of SQL.
>
> Kenn
>
> On Wed, May 16, 2018 at 10:08 AM Andrew Pilloud 
> wrote:
>
>> I'm currently adding JDBC support to Beam SQL! Unfortunately Calcite has
>> two distinct entry points, one for JDBC and one for everything else (see
>> CALCITE-1525). Eventually that will change, but I'd like to avoid having
>> two versions of Beam SQL until Calcite converges on a single path for
>> parsing SQL. Here are the options I am looking at:
>>
>> 1. Make JDBC the source of truth for Calcite config and state. Generate a
>> FrameworkConfig based on the JDBC connection and continue to use the
>> non-JDBC interface to Calcite. This option comes with the risk that the two
>> paths into Calcite will diverge (as there is a bunch of code copied from
>> Calcite to generate the config), but is the easiest to implement and
>> understand.
>>
>> 2. Make JDBC the only path into Calcite. Use prepareStatement and unwrap
>> to extract a BeamRelNode out of the JDBC interface. This eliminates a
>> significant amount of code in Beam, but the unwrap path is a little
>> convoluted.
>>
>> Both options leave the user facing non-JDBC interface to Beam SQL
>> unchanged, these changes are internal.
>>
>> Andrew
>>
>

Eventual PAssert

2018-05-14 Thread Anton Kedin

Hi,

While working on an integration test
 for Pubsub-related functionality
I couldn't find a good solution to test the pipelines that don't reliably
stop.

I propose we extend PAssert to support eventual verification. In this case
some success/failure predicate is being constantly evaluated against all
elements of the pipeline until it's met. At that point the result gets
communicated to the main program/test.

Example API:

*PAssert  .thatEventually(pcollection)  .containsInAnyOrder(e1, e2, e3)
 .synchronizingOver(signalOverPubsub());  .timeoutAfter(10 min)*

Details doc


Comments, thoughts, things that I missed?

Regards,
Anton

Re: Pubsub to Beam SQL

2018-05-10 Thread Anton Kedin

Shared the doc.
There is already a table provider for Kafka with CSV records. The
implementation at the moment doesn't touch the IO itself, just wraps it.
Implementing Kafka JSON records can be as easy as wrapping KafkaIO with
JsonToRow
<https://github.com/apache/beam/blob/9c2b43227e1ddac39676f6c09aca1af82a9d4cdb/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/JsonToRow.java>
on top or implementing another SQL-specific transform similar to this
<https://github.com/apache/beam/pull/5253/files#diff-45ffe75359c57e7958a1d508c8a3657b>
.

On Thu, May 10, 2018 at 1:39 PM Ismaël Mejía <ieme...@gmail.com> wrote:

> Hi, Jumping a bit late to this discussion. This sounds super nice. But I
> could not access the document.
> How hard would it be to do this for other 'unbounded' sources, e.g. Kafka ?
> On Sat, May 5, 2018 at 2:56 AM Andrew Pilloud <apill...@google.com> wrote:
>
> > I don't think we should jump to adding a extension, but TBLPROPERTIES is
> already a DDL extension and it isn't user friendly. We should strive for a
> world where no one needs to use it. SQL needs the timestamp to be exposed
> as a column, we can't hide it without changing the definition of GROUP BY.
> I like Anton's proposal of adding it as an annotation in the column
> definition. That seems even simpler and more user friendly. We might even
> be able to get away with using the PRIMARY KEY keyword.
>
> > Andrew
>
> > On Fri, May 4, 2018 at 12:11 PM Anton Kedin <ke...@google.com> wrote:
>
> >> There are few aspects of the event timestamp definition in SQL, which we
> are talking about here:
>
> >> configuring the source. E.g. for PubsubIO you can choose whether to
> extract event timestamp from the message attributes or the message publish
> time:
>
> >> this is source-specific and cannot be part of the common DDL;
> >> TBLPROPERTIES, on the other hand, is an opaque json blob which exists
> specifically for source configuration;
> >> as Kenn is saying, some sources might not even have such configuration;
> >> at processing time, event timestamp is available in
> ProcessContext.timestamp() regardless of the specifics of the source
> configuration, so it can be extracted the same way for all sources, as
> Raghu said;
>
> >> designating one of the table columns as an event timestamp:
>
> >> query needs to be able to reference the event timestamp so we have to
> declare which column to populate with the event timestamp;
> >> this is common for all sources and we can create a special syntax, e.g.
> "columnName EVENT_TIMESTAMP". It must not contain source-specific
> configuration at this point, in my opinion;
> >> when SQL knows which column is supposed to be the timestamp, then it can
> get it from the ProcessContext.timestamp() and put it into the designated
> field the same way regardless of the source configuration;
>
> >> pubsub-specific message formatting:
>
> >> on top of the above we want to be able to expose pubsub message
> attributes, payload, and timestamp to the user queries, and do it without
> magic or user schema modifications. To do this we can enforce some
> pubsub-specific schema limitations, e.g. by exposing attributes and
> timestamp fields at a top-level schema, with payload going into the second
> level in its own field;
> >> this aspect is not fully implementable until we have support for complex
> types. Until then we cannot map full JSON to the payload field;
>
> >> I will update the doc and the implementation to reflect these comments
> where possible.
>
> >> Thank you,
> >> Anton
>
>
> >> On Fri, May 4, 2018 at 9:48 AM Raghu Angadi <rang...@google.com> wrote:
>
> >>> On Thu, May 3, 2018 at 12:47 PM Anton Kedin <ke...@google.com> wrote:
>
> >>>> I think it makes sense for the case when timestamp is provided in the
> payload (including pubsub message attributes).  We can mark the field as an
> event timestamp. But if the timestamp is internally defined by the source
> (pubsub message publish time) and not exposed in the event body, then we
> need a source-specific mechanism to extract and map the event timestamp to
> the schema. This is, of course, if we don't automatically add a magic
> timestamp field which Beam SQL can populate behind the scenes and add to
> the schema. I want to avoid this magic path for now.
>
>
> >>> Commented on the PR. As Kenn mentioned, every element in Beam has an
> event timestamp, there is no requirement to extract the timestamp by the
> SQL transform. Using the element timestamp takes care of Pubsub publish
> timestamp as well (in fact, this is the def

Re: Pubsub to Beam SQL

2018-05-04 Thread Anton Kedin

There are few aspects of the event timestamp definition in SQL, which we
are talking about here:

   - configuring the source. E.g. for PubsubIO you can choose whether to
   extract event timestamp from the message attributes or the message publish
   time:
   - this is source-specific and cannot be part of the common DDL;
  - TBLPROPERTIES, on the other hand, is an opaque json blob which
  exists specifically for source configuration;
  - as Kenn is saying, some sources might not even have such
  configuration;
  - at processing time, event timestamp is available in
  ProcessContext.timestamp() regardless of the specifics of the source
  configuration, so it can be extracted the same way for all sources, as
  Raghu said;
   - designating one of the table columns as an event timestamp:
  - query needs to be able to reference the event timestamp so we have
  to declare which column to populate with the event timestamp;
  - this is common for all sources and we can create a special syntax,
  e.g. "columnName EVENT_TIMESTAMP". It must not contain source-specific
  configuration at this point, in my opinion;
  - when SQL knows which column is supposed to be the timestamp, then
  it can get it from the ProcessContext.timestamp() and put it into the
  designated field the same way regardless of the source configuration;
  - pubsub-specific message formatting:
  - on top of the above we want to be able to expose pubsub message
  attributes, payload, and timestamp to the user queries, and do it without
  magic or user schema modifications. To do this we can enforce some
  pubsub-specific schema limitations, e.g. by exposing attributes and
  timestamp fields at a top-level schema, with payload going into
the second
  level in its own field;
  - this aspect is not fully implementable until we have support for
  complex types. Until then we cannot map full JSON to the payload field;

I will update the doc and the implementation to reflect these comments
where possible.

Thank you,
Anton


On Fri, May 4, 2018 at 9:48 AM Raghu Angadi <rang...@google.com> wrote:

> On Thu, May 3, 2018 at 12:47 PM Anton Kedin <ke...@google.com> wrote:
>
>> I think it makes sense for the case when timestamp is provided in the
>> payload (including pubsub message attributes).  We can mark the field as an
>> event timestamp. But if the timestamp is internally defined by the source
>> (pubsub message publish time) and not exposed in the event body, then we
>> need a source-specific mechanism to extract and map the event timestamp to
>> the schema. This is, of course, if we don't automatically add a magic
>> timestamp field which Beam SQL can populate behind the scenes and add to
>> the schema. I want to avoid this magic path for now.
>>
>
> Commented on the PR. As Kenn mentioned, every element in Beam has an event
> timestamp, there is no requirement to extract the timestamp by the SQL
> transform. Using the element timestamp takes care of Pubsub publish
> timestamp as well (in fact, this is the default when timestamp attribute is
> not specified in PubsubIO).
>
> How timestamp are customized is specific to each source. That way custom
> timestamp option seem like they belong in TBLPROPERTIES. E.g. for KafkaIO,
> it could specify "logAppendTime", "createTime", or "processingTime" etc
> (though I am not sure how user can provide their own custom extractor in
> Beam SQL, may be it could support a timestamp field in json records).
>
> Raghu.
>
>>
>> On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com>
>> wrote:
>>
>>> This sounds awesome!
>>>
>>> Is event timestamp something that we need to specify for every source?
>>> If so, I would suggest we add this as a first class option on CREATE TABLE
>>> rather then something hidden in TBLPROPERTIES.
>>>
>>> Andrew
>>>
>>> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <ke...@google.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I am working on adding functionality to support querying Pubsub
>>>> messages directly from Beam SQL.
>>>>
>>>> *Goal*
>>>>   Provide Beam users a pure  SQL solution to create the pipelines with
>>>> Pubsub as a data source, without the need to set up the pipelines in
>>>> Java before applying the query.
>>>>
>>>> *High level approach*
>>>>
>>>>-
>>>>- Build on top of PubsubIO;
>>>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>>>   - Beam SQL already supports declaring

Complex Types Support for Beam SQL DDL

2018-05-04 Thread Anton Kedin

Hi,

I am working on adding support for non-primitive types in Beam SQL DDL.

*Goal*
Allow users to define tables with Rows, Arrays, Maps as field types in DDL.
This enables defining schemas for complex sources, e.g. describing JSON
sources or other sources which support complex field types (BQ, etc).

*Solution*
Extend the parser we have in Beam SQLto accept the following DDL statement:
"CREATE TABLE tableName (field_name )" where
"" can be any the following:

   - "primitiveType ARRAY", for example, "field_int_arr" INTEGER ARRAY".
   Thoughts:
   - this is how SQL standard defines ARRAY field declaration;
  - existing parser supports similar syntax for collections;
  - hard to read for nested collections;
  - similar syntax is supported in Postgres
  ;
   - "ARRAY", for example "field_matrix ARRAY".
   Thoughts:
   - easy to read and support arbitrary nesting;
  - similar syntax is implemented in:
 - BigQuery
 

 ;
 - Spanner
 
 ;
 - KSQL
 

 ;
 - Spark/Hive
 

 ;
  - "MAP", for example "MAP>". Thoughts:
   - there doesn't seem to be a SQL standard support for maps;
  - looks similar to the "ARRAY" definition;
  - similar syntax is implemented in:
 - KSQL
 

 ;
 - Spark/Hive
 

 ;
  - "ROW(fieldList)", for example "row_field ROW(f_int INTEGER, f_str
   VARCHAR)". Thoughts:
   - SQL standard defines the syntax this way;
  - don't know where similar syntax is implemented;
   - "ROW", for example "row_field ROW". Thoughts:
   - ROW is not supported in a lot of dialects, but STRUCT is similar and
  supported in few dialects;
  - similar syntax for STRUCT is implemented in:
 - BigQuery
 ;
 - Spark/Hive
 

 ;

Questions/comments?
Pull Request 

Thank you,
Anton

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin

A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
will probably need to a way to expose a message publish timestamp if we
want to use it as an event timestamp, but that will be consumed by the same
wrapper/transform without adding anything schema or SQL-specific to
PubsubIO itself.

On Thu, May 3, 2018 at 11:44 AM Reuven Lax <re...@google.com> wrote:

> Are you planning on integrating this directly into PubSubIO, or add a
> follow-on transform?
>
> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <ke...@google.com> wrote:
>
>> Hi
>>
>> I am working on adding functionality to support querying Pubsub messages
>> directly from Beam SQL.
>>
>> *Goal*
>>   Provide Beam users a pure  SQL solution to create the pipelines with
>> Pubsub as a data source, without the need to set up the pipelines in
>> Java before applying the query.
>>
>> *High level approach*
>>
>>-
>>- Build on top of PubsubIO;
>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>   - Beam SQL already supports declaring sources like Kafka and Text
>>   using CREATE TABLE DDL;
>>   - it supports additional configuration using TBLPROPERTIES clause.
>>   Currently it takes a text blob, where we can put a JSON configuration;
>>   - wrapping PubsubIO into a similar source looks feasible;
>>- The plan is to initially support messages only with JSON payload:
>>-
>>   - more payload formats can be added later;
>>- Messages will be fully described in the CREATE TABLE statements:
>>   - event timestamps. Source of the timestamp is configurable. It is
>>   required by Beam SQL to have an explicit timestamp column for windowing
>>   support;
>>   - messages attributes map;
>>   - JSON payload schema;
>>- Event timestamps will be taken either from publish time or
>>user-specified message attribute (configurable);
>>
>> Thoughts, ideas, comments?
>>
>> More details are in the doc here:
>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>
>>
>> Thank you,
>> Anton
>>
>

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin

I think it makes sense for the case when timestamp is provided in the
payload (including pubsub message attributes).  We can mark the field as an
event timestamp. But if the timestamp is internally defined by the source
(pubsub message publish time) and not exposed in the event body, then we
need a source-specific mechanism to extract and map the event timestamp to
the schema. This is, of course, if we don't automatically add a magic
timestamp field which Beam SQL can populate behind the scenes and add to
the schema. I want to avoid this magic path for now.

On Thu, May 3, 2018 at 11:10 AM Andrew Pilloud <apill...@google.com> wrote:

> This sounds awesome!
>
> Is event timestamp something that we need to specify for every source? If
> so, I would suggest we add this as a first class option on CREATE TABLE
> rather then something hidden in TBLPROPERTIES.
>
> Andrew
>
> On Wed, May 2, 2018 at 10:30 AM Anton Kedin <ke...@google.com> wrote:
>
>> Hi
>>
>> I am working on adding functionality to support querying Pubsub messages
>> directly from Beam SQL.
>>
>> *Goal*
>>   Provide Beam users a pure  SQL solution to create the pipelines with
>> Pubsub as a data source, without the need to set up the pipelines in
>> Java before applying the query.
>>
>> *High level approach*
>>
>>-
>>- Build on top of PubsubIO;
>>- Pubsub source will be declared using CREATE TABLE DDL statement:
>>   - Beam SQL already supports declaring sources like Kafka and Text
>>   using CREATE TABLE DDL;
>>   - it supports additional configuration using TBLPROPERTIES clause.
>>   Currently it takes a text blob, where we can put a JSON configuration;
>>   - wrapping PubsubIO into a similar source looks feasible;
>>- The plan is to initially support messages only with JSON payload:
>>-
>>   - more payload formats can be added later;
>>- Messages will be fully described in the CREATE TABLE statements:
>>   - event timestamps. Source of the timestamp is configurable. It is
>>   required by Beam SQL to have an explicit timestamp column for windowing
>>   support;
>>   - messages attributes map;
>>   - JSON payload schema;
>>- Event timestamps will be taken either from publish time or
>>user-specified message attribute (configurable);
>>
>> Thoughts, ideas, comments?
>>
>> More details are in the doc here:
>> https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE
>>
>>
>> Thank you,
>> Anton
>>
>

Pubsub to Beam SQL

2018-05-02 Thread Anton Kedin

Hi

I am working on adding functionality to support querying Pubsub messages
directly from Beam SQL.

*Goal*
  Provide Beam users a pure  SQL solution to create the pipelines with
Pubsub as a data source, without the need to set up the pipelines in Java
before applying the query.

*High level approach*

   -
   - Build on top of PubsubIO;
   - Pubsub source will be declared using CREATE TABLE DDL statement:
  - Beam SQL already supports declaring sources like Kafka and Text
  using CREATE TABLE DDL;
  - it supports additional configuration using TBLPROPERTIES clause.
  Currently it takes a text blob, where we can put a JSON configuration;
  - wrapping PubsubIO into a similar source looks feasible;
   - The plan is to initially support messages only with JSON payload:
   -
  - more payload formats can be added later;
   - Messages will be fully described in the CREATE TABLE statements:
  - event timestamps. Source of the timestamp is configurable. It is
  required by Beam SQL to have an explicit timestamp column for windowing
  support;
  - messages attributes map;
  - JSON payload schema;
   - Event timestamps will be taken either from publish time or
   user-specified message attribute (configurable);

Thoughts, ideas, comments?

More details are in the doc here:
https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE


Thank you,
Anton

Re: Beam SQL Improvements

2018-04-27 Thread Anton Kedin

Romain,

I don't believe that JSON approach was investigated very thoroughIy. I
mentioned few reasons which will make it not the best choice my opinion,
but I may be wrong. Can you put together a design doc or a prototype?

Thank you,
Anton


On Thu, Apr 26, 2018 at 10:17 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

>
>
> Le 26 avr. 2018 23:13, "Anton Kedin" <ke...@google.com> a écrit :
>
> BeamRecord (Row) has very little in common with JsonObject (I assume
> you're talking about javax.json), except maybe some similarities of the
> API. Few reasons why JsonObject doesn't work:
>
>- it is a Java EE API:
>   - Beam SDK is not limited to Java. There are probably similar APIs
>   for other languages but they might not necessarily carry the same 
> semantics
>   / APIs;
>
>
> Not a big deal I think. At least not a technical blocker.
>
>
>- It can change between Java versions;
>
> No, this is javaee ;).
>
>
>
>- Current Beam java implementation is an experimental feature to
>   identify what's needed from such API, in the end we might end up with
>   something similar to JsonObject API, but likely not
>
>
> I dont get that point as a blocker
>
>
>- ;
>   - represents JSON, which is not an API but an object notation:
>   - it is defined as unicode string in a certain format. If you
>   choose to adhere to ECMA-404, then it doesn't sound like JsonObject can
>   represent an Avro object, if I'm reading it right;
>
>
> It is in the generator impl, you can impl an avrogenerator.
>
>
>- doesn't define a type system (JSON does, but it's lacking):
>   - for example, JSON doesn't define semantics for numbers;
>   - doesn't define date/time types;
>   - doesn't allow extending JSON type system at all;
>
>
> That is why you need a metada object, or simpler, a schema with that data.
> Json or beam record doesnt help here and you end up on the same outcome if
> you think about it.
>
>
>- lacks schemas;
>
> Jsonschema are standard, widely spread and tooled compared to alternative.
>
> You can definitely try loosen the requirements and define everything in
> JSON in userland, but the point of Row/Schema is to avoid it and define
> everything in Beam model, which can be extended, mapped to JSON, Avro,
> BigQuery Schemas, custom binary format etc., with same semantics across
> beam SDKs.
>
>
> This is what jsonp would allow with the benefit of a natural pojo support
> through jsonb.
>
>
>
> On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Just to let it be clear and let me understand: how is BeamRecord
>> different from a JsonObject which is an API without implementation (not
>> event a json one OOTB)? Advantage of json *api* are indeed natural mapping
>> (jsonb is based on jsonp so no new binding to reinvent) and simple
>> serialization (json+gzip for ex, or avro if you want to be geeky).
>>
>> I fail to see the point to rebuild an ecosystem ATM.
>>
>> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>>
>>> Exactly what JB said. We will write a generic conversion from Avro (or
>>> json) to Beam schemas, which will make them work transparently with SQL.
>>> The plan is also to migrate Anton's work so that POJOs works generically
>>> for any schema.
>>>
>>> Reuven
>>>
>>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> For now we have a generic schema interface. Json-b can be an impl, avro
>>>> could be another one.
>>>>
>>>> Regards
>>>> JB
>>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com> a
>>>> écrit:
>>>>>
>>>>> Hmm,
>>>>>
>>>>> avro has still the pitfalls to have an uncontrolled stack which brings
>>>>> way too much dependencies to be part of any API,
>>>>> this is why I proposed a JSON-P based API (JsonObject) with a custom
>>>>> beam entry for some metadata (headers "à la Camel").
>>>>>
>>>>>
>>>>> Romain Manni-Bucau
>>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>> <http://rmannibucau.wordpress.com> |  Github
>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>> <https://www.link

Re: Beam SQL Improvements

2018-04-26 Thread Anton Kedin

BeamRecord (Row) has very little in common with JsonObject (I assume you're
talking about javax.json), except maybe some similarities of the API. Few
reasons why JsonObject doesn't work:

   - it is a Java EE API:
  - Beam SDK is not limited to Java. There are probably similar APIs
  for other languages but they might not necessarily carry the
same semantics
  / APIs;
  - It can change between Java versions;
  - Current Beam java implementation is an experimental feature to
  identify what's needed from such API, in the end we might end up with
  something similar to JsonObject API, but likely not;
  - represents JSON, which is not an API but an object notation:
  - it is defined as unicode string in a certain format. If you choose
  to adhere to ECMA-404, then it doesn't sound like JsonObject can
represent
  an Avro object, if I'm reading it right;
   - doesn't define a type system (JSON does, but it's lacking):
  - for example, JSON doesn't define semantics for numbers;
  - doesn't define date/time types;
  - doesn't allow extending JSON type system at all;
   - lacks schemas;

You can definitely try loosen the requirements and define everything in
JSON in userland, but the point of Row/Schema is to avoid it and define
everything in Beam model, which can be extended, mapped to JSON, Avro,
BigQuery Schemas, custom binary format etc., with same semantics across
beam SDKs.


On Thu, Apr 26, 2018 at 12:28 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Just to let it be clear and let me understand: how is BeamRecord different
> from a JsonObject which is an API without implementation (not event a json
> one OOTB)? Advantage of json *api* are indeed natural mapping (jsonb is
> based on jsonp so no new binding to reinvent) and simple serialization
> (json+gzip for ex, or avro if you want to be geeky).
>
> I fail to see the point to rebuild an ecosystem ATM.
>
> Le 26 avr. 2018 19:12, "Reuven Lax" <re...@google.com> a écrit :
>
>> Exactly what JB said. We will write a generic conversion from Avro (or
>> json) to Beam schemas, which will make them work transparently with SQL.
>> The plan is also to migrate Anton's work so that POJOs works generically
>> for any schema.
>>
>> Reuven
>>
>> On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> For now we have a generic schema interface. Json-b can be an impl, avro
>>> could be another one.
>>>
>>> Regards
>>> JB
>>> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com> a
>>> écrit:
>>>>
>>>> Hmm,
>>>>
>>>> avro has still the pitfalls to have an uncontrolled stack which brings
>>>> way too much dependencies to be part of any API,
>>>> this is why I proposed a JSON-P based API (JsonObject) with a custom
>>>> beam entry for some metadata (headers "à la Camel").
>>>>
>>>>
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> |  Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>
>>>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>>>
>>>>> Hi Ismael
>>>>>
>>>>> You mean directly in Beam SQL ?
>>>>>
>>>>> That will be part of schema support: generic record could be one of
>>>>> the payload with across schema.
>>>>>
>>>>> Regards
>>>>> JB
>>>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < ieme...@gmail.com> a
>>>>> écrit:
>>>>>>
>>>>>> Hello Anton,
>>>>>>
>>>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>>>> is a natural fit for this approach too.
>>>>>>
>>>>>> Regards,
>>>>>> Ismaël
>>>>>>
>>>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>>>
>>>>>>
>>>>>>>Hi,
>>>>>>>
>>>>>>>
>>>&

Re: Beam SQL Improvements

2018-04-26 Thread Anton Kedin

Yes, that's my understanding where the Schema work is heading towards.
Generic Row+Schema are in core java SDK and potentially can be backed by
Avro or JSON or something else as an implementation/configuration detail.
At the moment though the only implementation we have relies on RowCoder.

On Thu, Apr 26, 2018 at 1:17 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> For now we have a generic schema interface. Json-b can be an impl, avro
> could be another one.
>
> Regards
> JB
> Le 26 avr. 2018, à 12:08, Romain Manni-Bucau <rmannibu...@gmail.com> a
> écrit:
>>
>> Hmm,
>>
>> avro has still the pitfalls to have an uncontrolled stack which brings
>> way too much dependencies to be part of any API,
>> this is why I proposed a JSON-P based API (JsonObject) with a custom beam
>> entry for some metadata (headers "à la Camel").
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |   Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> |  Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>
>> 2018-04-26 9:59 GMT+02:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Hi Ismael
>>>
>>> You mean directly in Beam SQL ?
>>>
>>> That will be part of schema support: generic record could be one of the
>>> payload with across schema.
>>>
>>> Regards
>>> JB
>>> Le 26 avr. 2018, à 11:39, "Ismaël Mejía" < ieme...@gmail.com> a écrit:
>>>>
>>>> Hello Anton,
>>>>
>>>> Thanks for the descriptive email and the really useful work. Any plans
>>>> to tackle PCollections of GenericRecord/IndexedRecords? it seems Avro
>>>> is a natural fit for this approach too.
>>>>
>>>> Regards,
>>>> Ismaël
>>>>
>>>> On Wed, Apr 25, 2018 at 9:04 PM, Anton Kedin <ke...@google.com> wrote:
>>>>
>>>>
>>>>>Hi,
>>>>>
>>>>>
>>>>>
>>>>>  I want to highlight a couple of improvements to Beam SQL we have been
>>>>>
>>>>>  working on recently which are targeted to make Beam SQL API easier to 
>>>>> use.
>>>>>
>>>>>  Specifically these features simplify conversion of Java Beans and JSON
>>>>>
>>>>>  strings to Rows.
>>>>>
>>>>>
>>>>>
>>>>>  Feel free to try this and send any bugs/comments/PRs my way.
>>>>>
>>>>>
>>>>>
>>>>>  **Caveat: this is still work in progress, and has known bugs and 
>>>>> incomplete
>>>>>
>>>>>  features, see below for details.**
>>>>>
>>>>>
>>>>>
>>>>>  Background
>>>>>
>>>>>
>>>>>
>>>>>  Beam SQL queries can only be applied to PCollection. This means that
>>>>>
>>>>>  users need to convert whatever PCollection elements they have to Rows 
>>>>> before
>>>>>
>>>>>  querying them with SQL. This usually requires manually creating a Schema 
>>>>> and
>>>>>
>>>>>  implementing a custom conversion PTransform<PCollection<
>>>>>   Element>,
>>>>>
>>>>>  PCollection> (see Beam SQL Guide).
>>>>>
>>>>>
>>>>>
>>>>>  The improvements described here are an attempt to reduce this overhead 
>>>>> for
>>>>>
>>>>>  few common cases, as a start.
>>>>>
>>>>>
>>>>>
>>>>>  Status
>>>>>
>>>>>
>>>>>
>>>>>  Introduced a InferredRowCoder to automatically generate rows from beans.
>>>>>
>>>>>  Removes the need to manually define a Schema and Row conversion logic;
>>>>>
>>>>>  Introduced JsonToRow transform to automatically parse JSON objects to 
>>>>> Rows.
>>>>>
>>>>>  Removes the need to manually implement a conversion logic;
>>>>>
>>>>>  This is still experimental work in progre

Beam SQL Improvements

2018-04-25 Thread Anton Kedin

Hi,

I want to highlight a couple of improvements to Beam SQL we have been
working on recently which are targeted to make Beam SQL API easier to use.
Specifically these features simplify conversion of Java Beans and JSON
strings to Rows.

Feel free to try this and send any bugs/comments/PRs my way.

***Caveat: this is still work in progress, and has known bugs and
incomplete features, see below for details.***

Background

Beam SQL queries can only be applied to PCollection. This means that
users need to convert whatever PCollection elements they have to Rows
before querying them with SQL. This usually requires manually creating a
Schema and implementing a custom conversion PTransform (see Beam SQL Guide
).

The improvements described here are an attempt to reduce this overhead for
few common cases, as a start.

Status

   - Introduced a InferredRowCoder to automatically generate rows from
   beans. Removes the need to manually define a Schema and Row conversion
   logic;
   - Introduced JsonToRow transform to automatically parse JSON objects to
   Rows. Removes the need to manually implement a conversion logic;
   - This is still experimental work in progress, APIs will likely change;
   - There are known bugs/unsolved problems;


Java Beans

Introduced a coder which facilitates Rows generation from Java Beans.
Reduces the overhead to:

/** Some user-defined Java Bean */
> class JavaBeanObject implements Serializable {
> String getName() { ... }
> }
>


// Obtain the objects:
> PCollection javaBeans = ...;



// Convert to Rows and apply a SQL query:
> PCollection queryResult =
>   javaBeans
>  .setCoder(InferredRowCoder.ofSerializable(JavaBeanObject.class))
>  .apply(BeamSql.query("SELECT name FROM PCOLLECTION"));


Notice, there is no more manual Schema definition or custom conversion
logic.

*Links*

   -  example
   

   ;
   -  InferredRowCoder
   

   ;
   -  test
   

   ;


JSON

Introduced JsonToRow transform. It is possible to query a
PCollection that contains JSON objects like this:

// Assuming JSON objects look like this:
> // { "type" : "foo", "size" : 333 }
>
> // Define a Schema:
> Schema jsonSchema =
>   Schema
> .builder()
> .addStringField("type")
> .addInt32Field("size")
> .build();
>
> // Obtain PCollection of the objects in JSON format:
> PCollection jsonObjects = ...
>
> // Convert to Rows and apply a SQL query:
> PCollection queryResults =
>   jsonObjects
> .apply(JsonToRow.withSchema(jsonSchema))
> .apply(BeamSql.query("SELECT type, AVG(size) FROM PCOLLECTION GROUP
> BY type"));


Notice, JSON to Row conversion is done by JsonToRow transform. It is
currently required to supply a Schema.

*Links*

   -  JsonToRow
   

   ;
   -  test/example
   

   ;


Going Forward

   - fix bugs (BEAM-4163 ,
   BEAM-4161  ...)
   - implement more features (BEAM-4167
   , more types of
   objects);
   - wire this up with sources/sinks to further simplify SQL API;


Thank you,
Anton

Re: New beam contributor experience?

2018-03-14 Thread Anton Kedin

Not sure if it was mentioned in other threads, but it probably makes sense
to add gradle instructions there.

On Wed, Mar 14, 2018 at 11:48 AM Alan Myrvold  wrote:

> There is a contribution guide at
> https://beam.apache.org/contribute/contribution-guide/
> Has anyone had challenges / pain points when getting started with new
> contributions?
> Any suggestions for making this better?
>
> Alan
>

Re: slack @the-asf?

2018-03-14 Thread Anton Kedin

What's the plan for users without `@apache.org` email?
The page says to contact a workspace administrator for an invitation. Will
all existing users be automatically invited to the new workspace?


On Wed, Mar 14, 2018 at 9:58 AM Thomas Weise  wrote:

> After you enter the ASF ID on https://the-asf.slack.com/signup an email
> will go to your ASF address. Just completed that successfully.
>
>
> --
> sent from mobile
>
> On Wed, Mar 14, 2018, 9:37 AM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> 2018-03-14 17:28 GMT+01:00 Lukasz Cwik :
>>
>>> Telling people to migrate without updating the website is not friendly
>>> to the community or providing the self enrollment link is not friendly to
>>> those that helped invite all those existing users.
>>> We should provide users on the old list a self enrollment link to the
>>> new list so that we don't have to handle the 100s of invites manually.
>>>
>>> Also, I tried joining the-asf slack using my @apache.org address and
>>> was unable to join. How does one self enroll in this new channel?
>>>
>>
>> Normally you have a big button, you enter your apache.org mail and can
>> log in directly. What did you see?
>>
>>
>>>
>>> On Wed, Mar 14, 2018 at 9:19 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Can help but I have no idea how to do it yet, don't hesitate to ping me
 if you don't have much cycles.


 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

 2018-03-14 17:16 GMT+01:00 Jean-Baptiste Onofré :

> Do you want me to prepare the site PR with you ?
>
> Regards
> JB
> Le 14 mars 2018, à 09:12, Romain Manni-Bucau 
> a écrit:
>>
>> updated the slack message (feel free to rephrase/reformat it, the
>> length is quite limited so it is ok) and will send the message just after
>> this one on #general
>> will need help for the website update on the 26th ;)
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |   Blog
>>  | Old Blog
>>  |  Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-14 16:48 GMT+01:00 Reuven Lax :
>>
>>> I don't have a strong feeling here. As long as there are
>>> instructions on how to use the new slack channel, sounds good to me!
>>>
>>>
>>> On Wed, Mar 14, 2018 at 1:35 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Hi guys,

 What do you think to migrate to the standard asf slack? I would
 make it a bit more easy to find beam channel IMHO and it would stay
 consistent with others. It also allows to auto join for asf guys.

 If you think it is the way to go we can do:

 1. put a message on current slack channel saying "we are moving to
 the-asf #beam, this channel will be closed on the XXX"
 2. notify it on the general (current) channel
 3. update the website

 Personally I think a transition period of 10 days is enough then
 channels can be archived.

 Wdyt?

 Side note: asf is in the process to try to get history on slack as
 well which would be beneficial for beam too.

 Romain Manni-Bucau
 @rmannibucau  |   Blog
  | Old Blog
  |  Github
  | LinkedIn
  | Book
 

>>>
>>

>>>
>>

1 2 >

1 - 100 of 103 matches

Mail list logo