from:"jincheng sun"

Re: [VOTE + INPUT] Beam Mascot Designs, 3rd iteration - Deadline Wednesday, April 1

2020-04-01 Thread jincheng sun

Thanks Julian!

> 1. Do you prefer stripes or no stripes?
+1 for no stripes.

I like the design even the early Sketches. Great work!

Best,
Jincheng



Jan Lukavský  于2020年3月31日周二 下午4:33写道：

> On 3/31/20 4:06 AM, Julian Bruno wrote:
>
> Hello Apache Beam Community,
>
> We need a third input from the community to finish the design. Please
> share your input no later than Wednesday, April 1st, at noon Pacific Time.
> Below you will find a link to the presentation of the work process and we
> are eager to know what you think of the current design [1].
>
> Our question to you:
>
> 1. Do you prefer stripes or no stripes?
>
> +1 for no stripes. +0.5 for stripes. +1 overall, nice work!
>
>
> Please reply inline, so it is clear what exactly you are referring to. The
> vote and input phase will be open until Wednesday, April 1st, at 12 pm
> Pacific Time. We will incorporate the feedback into the final design
> iteration of the mascot.
>
> Please find below the attached source file (.SVG) and a High-Quality Image
> (.PNG).
>
> Thank you,
>
> --
> Julian Bruno // Visual Artist & Graphic Designer
>  (510) 367-0551 / SF Bay Area, CA
> www.instagram.com/julbro.art
>
> [1]
>
>  3/30 - Mascot Weekly Update
> 
>
>
>
> ᐧ
>
>

Re: [DISCUSS] Drop support for Flink 1.7

2020-03-11 Thread jincheng sun

Hi all,
I would like to drop the flink 1.7 support soon, as flink 1.10 already
supported in this commit.
https://github.com/apache/beam/commit/f91b390c8bbab4afe14734c1266da51dcc7558c9

Best,
Jincheng



jincheng sun  于2020年2月24日周一 上午11:22写道：

> Thanks for all of your feedback. Will dropping the 1.7 support after
> https://github.com/apache/beam/pull/10945 has be merged.
>
> Best,
> Jincheng
>
>
> David Morávek  于2020年2月19日周三 下午7:08写道：
>
>> +1 for dropping 1.7, once we have 1.10 support ready
>>
>> D.
>>
>> On Tue, Feb 18, 2020 at 7:01 PM  wrote:
>>
>>> Hi Ismael,
>>> yes, sure. The proposal would be to have snapshot dependency in the
>>> feature branch. The snapshot must be changed to release before merge to
>>> master.
>>> Jan
>>>
>>> Dne 18. 2. 2020 17:55 napsal uživatel Ismaël Mejía :
>>>
>>> Just to be sure, you mean Flink 1.11.0-SNAPSHOT ONLY in the next branch
>>> dependency?
>>> We should not have any SNAPSHOT dependency from other projects in Beam.
>>>
>>> On Tue, Feb 18, 2020 at 5:34 PM  wrote:
>>>
>>> Hi=Cr�Jincheng,
>>> I think there should be a "process" for this. I would propose to:
>>>  a) create new branch with support for new (snapshot) flink - currently
>>> that would mean flink 1.11
>>>  b) as part of this brach drop support for all version up to N - 3
>>> I think th!t that dropping all versions and adding new version should be
>>> atomic, otherwise we risk we release beam version with less than three
>>> supprted flink versions.
>>> I'd suggest to start with the 1.10 branch support, include the drop of
>>> 1.7 into that branch. Once 1.10 gets merged, we should create 1.11 with
>>> snapshot dependency to be able to keep up with the release cadence of flink.
>>> WDYT?
>>>  Jan
>>>
>>> Dne 18. 2. 2020 15:34 napsal uživatel jincheng sun <
>>> sunjincheng...@gmail.com>:
>>>
>>> Hi folks,
>>>
>>> Apache Flink 1.10 has completed the release announcement [1]n Then we
>>> would like to add Flink 1.10 build target and make Flink Runner compatible
>>> with Flink 1.10 [2]. So, I would suggest that at most three versions of
>>> Flink runner for Apache Beam community according to the update Policy of
>>> Apache Flink releases [3] , i.e. I think it's better to maintain the three
>>> versions of 1.8/1.9/1.10 after add Flink 1.10 build target to Flink runner.
>>>
>>> The current existence of Flink runner 1.7 will affect the upgrade of
>>> Flink runner 1.8x and 1.9x due to the code of Flink 1.7 is too old, more
>>> detail can be found in [4]. So,  we need to drop the support of Flink
>>> runner 1.7 as soon as possible.
>>>
>>> This discussion also CC to @User, due to the change will affect our
>>> users. And I would appreciate it if you could review the PR [5].
>>>
>>> Welcome any feedback!
>>>
>>> Best,
>>> Jincheng
>>>
>>> [1]
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flin+-1-10-0-released-td37564.html
>>> <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flink-1-10-0-released-td37564.html>
>>> [2] https://issues.apache.org/jira/browse/BEAM-9295
>>> [3]
>>> https://flink.apache.org/downloads.html#update-policy-for-old-releases
>>> [4] https://issues.apache.org/jira/browse/BEAM-9299
>>> [5] https://github.com/apache/beam/pull/10884
>>>
>>>
>>>
>>>

Re: [ANNOUNCE] New committer: Jincheng Sun

2020-02-27 Thread jincheng sun

Thanks for all of your warm welcomes. It is really a pleasure working with
you and the community!

Valentyn Tymofieiev  于2020年2月26日周三 上午10:57写道：

> Congratulations, Jincheng!
>
> On Tue, Feb 25, 2020 at 5:02 PM Chamikara Jayalath 
> wrote:
>
>> Congrats Jincheng!
>>
>> On Tue, Feb 25, 2020 at 10:14 AM Rui Wang  wrote:
>>
>>> Congrats!
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Feb 24, 2020 at 11:24 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
>>>> Congrats!
>>>>
>>>> On Mon, Feb 24, 2020, 11:22 PM Alex Van Boxel  wrote:
>>>>
>>>>> Congrats!
>>>>>
>>>>>  _/
>>>>> _/ Alex Van Boxel
>>>>>
>>>>>
>>>>> On Mon, Feb 24, 2020 at 8:13 PM Kyle Weaver 
>>>>> wrote:
>>>>>
>>>>>> Thanks Jincheng for all your work on Beam and Flink integration.
>>>>>>
>>>>>> On Mon, Feb 24, 2020 at 11:02 AM Yichi Zhang 
>>>>>> wrote:
>>>>>>
>>>>>>> Congrats, Jincheng!
>>>>>>>
>>>>>>> On Mon, Feb 24, 2020 at 9:45 AM Ahmet Altay 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Congratulations!
>>>>>>>>
>>>>>>>> On Mon, Feb 24, 2020 at 6:48 AM Thomas Weise 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Congratulations!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 24, 2020 at 6:45 AM Ismaël Mejía 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Congrats Jincheng!
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 24, 2020 at 1:39 PM Gleb Kanterov 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Congratulations!
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 24, 2020 at 1:18 PM Hequn Cheng 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Congratulations Jincheng, well deserved!
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Hequn
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 24, 2020 at 7:21 PM Reza Rokni 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Congrats!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 24, 2020 at 7:15 PM Jan Lukavský 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Congrats Jincheng!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   Jan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2/24/20 11:55 AM, Maximilian Michels wrote:
>>>>>>>>>>>>>> > Hi everyone,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Please join me and the rest of the Beam PMC in welcoming a
>>>>>>>>>>>>>> new
>>>>>>>>>>>>>> > committer: Jincheng Sun 
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Jincheng has worked on generalizing parts of Beam for
>>>>>>>>>>>>>> Flink's Python
>>>>>>>>>>>>>> > API. He has also picked up other issues, like fixing
>>>>>>>>>>>>>> documentation,
>>>>>>>>>>>>>> > implementing missing features, or cleaning up code [1].
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > In consideration of his contributions, the Beam PMC trusts
>>>>>>>>>>>>>> him with
>>>>>>>>>>>>>> > the responsibilities of a Beam committer [2].
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thank you for your contributions Jincheng!
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > -Max, on behalf of the Apache Beam PMC
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > [1]
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> https://jira.apache.org/jira/browse/BEAM-9299?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(sunjincheng121)
>>>>>>>>>>>>>> > [2]
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [ANNOUNCE] New committer: Chad Dombrova

2020-02-25 Thread jincheng sun

Congratulations Chad!

Best,
Jincheng


Jan Lukavský  于2020年2月25日周二 下午5:05写道：

> Congrats Chad!
> On 2/25/20 9:48 AM, Gleb Kanterov wrote:
>
> Congratulations!
>
> On Tue, Feb 25, 2020 at 9:44 AM Ismaël Mejía  wrote:
>
>> Congratulations, so well deserved for the lots of amazing work and new
>> perspectives you have broght into the project !!!
>>
>> On Tue, Feb 25, 2020 at 8:24 AM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> Hooray!
>>>
>>> On Mon, Feb 24, 2020, 11:21 PM Alex Van Boxel  wrote:
>>>
 Congrats!

  _/
 _/ Alex Van Boxel


 On Tue, Feb 25, 2020 at 6:21 AM Thomas Weise  wrote:

> Congratulations!
>
>
> On Mon, Feb 24, 2020, 3:38 PM Ankur Goenka  wrote:
>
>> Congratulations Chad!
>>
>> On Mon, Feb 24, 2020 at 3:34 PM Ahmet Altay  wrote:
>>
>>> Congratulations!
>>>
>>> On Mon, Feb 24, 2020 at 3:25 PM Sam Bourne 
>>> wrote:
>>>
 Nice one Chad. Your typing efforts are very welcomed.

 On Tue, Feb 25, 2020 at 10:16 AM Yichi Zhang 
 wrote:

> Congratulations, Chad!
>
> On Mon, Feb 24, 2020 at 3:10 PM Robert Bradshaw <
> rober...@google.com> wrote:
>
>> Well deserved, Chad. Congratulations!
>>
>> On Mon, Feb 24, 2020 at 2:43 PM Reza Rokni 
>> wrote:
>> >
>> > Congratulations! :-)
>> >
>> > On Tue, Feb 25, 2020 at 6:41 AM Chad Dombrova <
>> chad...@gmail.com> wrote:
>> >>
>> >> Thanks, folks!  I'm very excited to "retest this" :)
>> >>
>> >> Especially big thanks to Robert and Udi for all their hard
>> work reviewing my PRs.
>> >>
>> >> -chad
>> >>
>> >>
>> >> On Mon, Feb 24, 2020 at 1:44 PM Brian Hulette <
>> bhule...@google.com> wrote:
>> >>>
>> >>> Congratulations Chad! Thanks for all your contributions :)
>> >>>
>> >>> On Mon, Feb 24, 2020 at 1:43 PM Kyle Weaver <
>> kcwea...@google.com> wrote:
>> 
>>  Well-deserved, thanks for your dedication to the project
>> Chad. :)
>> 
>>  On Mon, Feb 24, 2020 at 1:34 PM Udi Meiri 
>> wrote:
>> >
>> > Congrats and welcome, Chad!
>> >
>> > On Mon, Feb 24, 2020 at 1:21 PM Pablo Estrada <
>> pabl...@google.com> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> Please join me and the rest of the Beam PMC in welcoming a
>> new committer: Chad Dombrova
>> >>
>> >> Chad has contributed to the project in multiple ways,
>> including improvements to the testing infrastructure, and adding type
>> annotations throughout the Python SDK, as well as working closely 
>> with the
>> community on these improvements.
>> >>
>> >> In consideration of his contributions, the Beam PMC trusts
>> him with the responsibilities of a Beam Committer[1].
>> >>
>> >> Thanks Chad for your contributions!
>> >>
>> >> -Pablo, on behalf of the Apache Beam PMC.
>> >>
>> >> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

Re: [DISCUSS] Drop support for Flink 1.7

2020-02-23 Thread jincheng sun

Thanks for all of your feedback. Will dropping the 1.7 support after
https://github.com/apache/beam/pull/10945 has be merged.

Best,
Jincheng


David Morávek  于2020年2月19日周三 下午7:08写道：

> +1 for dropping 1.7, once we have 1.10 support ready
>
> D.
>
> On Tue, Feb 18, 2020 at 7:01 PM  wrote:
>
>> Hi Ismael,
>> yes, sure. The proposal would be to have snapshot dependency in the
>> feature branch. The snapshot must be changed to release before merge to
>> master.
>> Jan
>>
>> Dne 18. 2. 2020 17:55 napsal uživatel Ismaël Mejía :
>>
>> Just to be sure, you mean Flink 1.11.0-SNAPSHOT ONLY in the next branch
>> dependency?
>> We should not have any SNAPSHOT dependency from other projects in Beam.
>>
>> On Tue, Feb 18, 2020 at 5:34 PM  wrote:
>>
>> Hi=Cr�Jincheng,
>> I think there should be a "process" for this. I would propose to:
>>  a) create new branch with support for new (snapshot) flink - currently
>> that would mean flink 1.11
>>  b) as part of this brach drop support for all version up to N - 3
>> I think th!t that dropping all versions and adding new version should be
>> atomic, otherwise we risk we release beam version with less than three
>> supprted flink versions.
>> I'd suggest to start with the 1.10 branch support, include the drop of
>> 1.7 into that branch. Once 1.10 gets merged, we should create 1.11 with
>> snapshot dependency to be able to keep up with the release cadence of flink.
>> WDYT?
>>  Jan
>>
>> Dne 18. 2. 2020 15:34 napsal uživatel jincheng sun <
>> sunjincheng...@gmail.com>:
>>
>> Hi folks,
>>
>> Apache Flink 1.10 has completed the release announcement [1]n Then we
>> would like to add Flink 1.10 build target and make Flink Runner compatible
>> with Flink 1.10 [2]. So, I would suggest that at most three versions of
>> Flink runner for Apache Beam community according to the update Policy of
>> Apache Flink releases [3] , i.e. I think it's better to maintain the three
>> versions of 1.8/1.9/1.10 after add Flink 1.10 build target to Flink runner.
>>
>> The current existence of Flink runner 1.7 will affect the upgrade of
>> Flink runner 1.8x and 1.9x due to the code of Flink 1.7 is too old, more
>> detail can be found in [4]. So,  we need to drop the support of Flink
>> runner 1.7 as soon as possible.
>>
>> This discussion also CC to @User, due to the change will affect our
>> users. And I would appreciate it if you could review the PR [5].
>>
>> Welcome any feedback!
>>
>> Best,
>> Jincheng
>>
>> [1]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flin+-1-10-0-released-td37564.html
>> <http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flink-1-10-0-released-td37564.html>
>> [2] https://issues.apache.org/jira/browse/BEAM-9295
>> [3]
>> https://flink.apache.org/downloads.html#update-policy-for-old-releases
>> [4] https://issues.apache.org/jira/browse/BEAM-9299
>> [5] https://github.com/apache/beam/pull/10884
>>
>>
>>
>>

Re: [PROPOSAL] Vendored bytebuddy dependency release

2020-02-23 Thread jincheng sun

+1

Best,
Jincheng


Kai Jiang  于2020年2月23日周日 上午2:26写道：

> +1
>
> On Sat, Feb 22, 2020 at 09:50 Alex Van Boxel  wrote:
>
>> +1
>>
>>  _/
>>
>> _/ Alex Van Boxel
>>
>>
>> On Sat, Feb 22, 2020 at 5:45 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1
>>>
>>> Regards
>>> JB
>>>
>>> Le sam. 22 f?vr. 2020 ? 15:02, Isma?l Mej?a  a
>>> ?crit :
>>>
 The version of bytebuddy Beam is vendoring (1.9.3) is more than 16
 months old. I
 would like to propose that we upgrade it [1] to the most recent version
 (1.10.8)
 [2] so we can benefit of the latest improvements for Java 11 and
 upgraded ASM.

 If everyone agrees I would like to volunteer as the release manager for
 this
 upgrade.

 [1] https://issues.apache.org/jira/browse/BEAM-9342
 [2] https://github.com/raphw/byte-buddy/blob/master/release-notes.md

Re: [VOTE] Vendored Dependencies Release gRPC 1.26.0 v0.2 for BEAM-9252

2020-02-22 Thread jincheng sun

+1(non-binding)

Best,
Jincheng


Kai Jiang  于2020年2月22日周六 下午2:58写道：

> +1 (non-binding)
>
> On Fri, Feb 21, 2020 at 5:03 PM Robin Qiu  wrote:
>
>> +1 (verified)
>>
>> On Fri, Feb 21, 2020 at 4:55 PM Robert Bradshaw 
>> wrote:
>>
>>> +1 (binding)
>>>
>>>
>>> On Fri, Feb 21, 2020 at 4:48 PM Ahmet Altay  wrote:
>>> >
>>> > +1
>>> >
>>> > On Fri, Feb 21, 2020 at 4:39 PM Luke Cwik  wrote:
>>> >>
>>> >> +1 (binding)
>>> >> I diffed the binary contents of the 0.1 jar and 0.2 jar with no
>>> changes to the contents of the files and can confirm that module-info.class
>>> the offending Main.class and Main$1.class have been removed as well.
>>> >>
>>> >> On Fri, Feb 21, 2020 at 4:38 PM Luke Cwik  wrote:
>>> >>>
>>> >>> Please review the release of the following artifacts that we vendor:
>>> >>>  * beam-vendor-grpc-1_26_0
>>> >>>
>>> >>> Hi everyone,
>>> >>> Please review and vote on the release candidate #1 for the version
>>> 0.2, as follows:
>>> >>> [ ] +1, Approve the release
>>> >>> [ ] -1, Do not approve the release (please provide specific comments)
>>> >>>
>>> >>>
>>> >>> The complete staging area is available for your review, which
>>> includes:
>>> >>> * the official Apache source release to be deployed to
>>> dist.apache.org [1], which is signed with the key with fingerprint
>>> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
>>> >>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> >>> * commit hash "91125d1d1fc1fe8c5684a486c9b6163c4ec41549" [4],
>>> >>>
>>> >>> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> >>>
>>> >>> Thanks,
>>> >>> Release Manager
>>> >>>
>>> >>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>> >>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> >>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1092/
>>> >>> [4]
>>> https://github.com/apache/beam/commit/91125d1d1fc1fe8c5684a486c9b6163c4ec41549
>>>
>> ᐧ
>

Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-18 Thread jincheng sun

Congratulations！
Best，
Jincheng

Robin Qiu 于2020年2月19日 周三05:52写道：

> Congratulations, Alex!
>
> On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev 
> wrote:
>
>> Congratulations!
>>
>> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel  wrote:
>>
>>> Thank you everyone!
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Tue, Feb 18, 2020 at 7:05 PM  wrote:
>>>
 Congrats Alex!
 Jan

 Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :

 Congratulations!

 On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía  wrote:

 Congrats Alex! Well done!

 On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov  wrote:

 Congratulations!

 On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
 wrote:

 Congratulations Alex! Well deserved!

 On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
 wrote:

 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming
 a new committer: Alex Van Boxel

 Alex has contributed to Beam in many ways - as an organizer for Beam
 Summit, and meetups - and also with the Protobuf extensions for schemas.

 In consideration of his contributions, the Beam PMC trusts him with the
 responsibilities of a Beam committer[1].

 Thanks for your contributions Alex!

 Pablo, on behalf of the Apache Beam PMC.

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

 --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-

[DISCUSS] Drop support for Flink 1.7

2020-02-18 Thread jincheng sun

Hi folks,

Apache Flink 1.10 has completed the release announcement [1]. Then we would
like to add Flink 1.10 build target and make Flink Runner compatible with
Flink 1.10 [2]. So, I would suggest that at most three versions of Flink
runner for Apache Beam community according to the update Policy of Apache
Flink releases [3] , i.e. I think it's better to maintain the three
versions of 1.8/1.9/1.10 after add Flink 1.10 build target to Flink runner.

The current existence of Flink runner 1.7 will affect the upgrade of Flink
runner 1.8x and 1.9x due to the code of Flink 1.7 is too old, more detail
can be found in [4]. So,  we need to drop the support of Flink runner 1.7
as soon as possible.

This discussion also CC to @User, due to the change will affect our users.
And I would appreciate it if you could review the PR [5].

Welcome any feedback!

Best,
Jincheng

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flink-1-10-0-released-td37564.html
[2] https://issues.apache.org/jira/browse/BEAM-9295
[3] https://flink.apache.org/downloads.html#update-policy-for-old-releases
[4] https://issues.apache.org/jira/browse/BEAM-9299
[5] https://github.com/apache/beam/pull/10884

Re: Sphinx Docs Command Error (:sdks:python:test-suites:tox:pycommon:docs)

2020-02-11 Thread jincheng sun

I think it's a good advice to remove the "-j 8" option if it doesn't affect
the performance much.


Udi Meiri  于2020年2月12日周三 上午2:20写道：

> For me the difference was about 20s longer (40s -> 60s approx). Not
> significant IMO
>
> On Tue, Feb 11, 2020 at 9:59 AM Ahmet Altay  wrote:
>
>> Should we remove the "-j 8" option by default? Sphinx docs says this is
>> an experimental option [1]. I do not recall docs generation taking a long
>> time, does this increase significantly without this option?
>>
>> [1] http://www.sphinx-doc.org/en/stable/man/sphinx-build.html
>>
>> On Tue, Feb 11, 2020 at 1:16 AM Shoaib Zafar <
>> shoaib.za...@venturedive.com> wrote:
>>
>>> Thanks, Udi and Jincheng for the response.
>>> The suggested solution worked for me as well.
>>>
>>> Regards,
>>>
>>> *Shoaib Zafar*
>>> Software Engineering Lead
>>> Mobile: +92 333 274 6242
>>> Skype: live:shoaibzafar_1
>>>
>>> <http://venturedive.com/>
>>>
>>>
>>> On Tue, Feb 11, 2020 at 1:17 PM jincheng sun 
>>> wrote:
>>>
>>>> I have verified that this issue could be reproduced in my local
>>>> environment (MacOS) and the solution suggested by Udi could work!
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>> Udi Meiri  于2020年2月11日周二 上午8:51写道：
>>>>
>>>>> I don't have those issues (running on Linux), but a possible
>>>>> workaround could be to remove the "-j 8" flags (2 locations) in
>>>>> generate_pydoc.sh.
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2020 at 11:06 AM Shoaib Zafar <
>>>>> shoaib.za...@venturedive.com> wrote:
>>>>>
>>>>>> Hello Beamers.
>>>>>>
>>>>>> Just curious does anyone having trouble running
>>>>>> ':sdks:python:test-suites:tox:pycommon:docs' command locally?
>>>>>>
>>>>>> After rebasing with master recently, I am facing sphinx thread fork
>>>>>> error with on my macos mojave, using python 3.7.0.
>>>>>> I Tried to add system variable "export
>>>>>> OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" (which I found on google)
>>>>>> but no luck!
>>>>>>
>>>>>> Any suggestions/help?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Console Log:
>>>>>> --
>>>>>> 
>>>>>> Creating file target/docs/source/apache_beam.utils.proto_utils.rst.
>>>>>> Creating file target/docs/source/apache_beam.utils.retry.rst.
>>>>>> Creating file
>>>>>> target/docs/source/apache_beam.utils.subprocess_server.rst.
>>>>>> Creating file
>>>>>> target/docs/source/apache_beam.utils.thread_pool_executor.rst.
>>>>>> Creating file target/docs/source/apache_beam.utils.timestamp.rst.
>>>>>> Creating file target/docs/source/apache_beam.utils.urns.rst.
>>>>>> Creating file target/docs/source/apache_beam.utils.rst.
>>>>>> objc[8384]: +[__NSCFConstantString initialize] may have been in
>>>>>> progress in another thread when fork() was called.
>>>>>> objc[8384]: +[__NSCFConstantString initialize] may have been in
>>>>>> progress in another thread when fork() was called. We cannot safely call 
>>>>>> it
>>>>>> or ignore it in the fork() child process. Crashing instead. Set a
>>>>>> breakpoint on objc_initializeAfterForkError to debug.
>>>>>>
>>>>>> Traceback (most recent call last):
>>>>>>   File
>>>>>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/cmd/build.py",
>>>>>> line 304, in build_main
>>>>>> app.build(args.force_all, filenames)
>>>>>>   File
>>>>>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/application.py",
>>>>>> line 335, in build
>>>>>> self.builder.build_all()
>>>>>>   File
>>>>>> &

Re: Sphinx Docs Command Error (:sdks:python:test-suites:tox:pycommon:docs)

2020-02-11 Thread jincheng sun

I have verified that this issue could be reproduced in my local environment
(MacOS) and the solution suggested by Udi could work!

Best,
Jincheng

Udi Meiri  于2020年2月11日周二 上午8:51写道：

> I don't have those issues (running on Linux), but a possible workaround
> could be to remove the "-j 8" flags (2 locations) in generate_pydoc.sh.
>
>
> On Mon, Feb 10, 2020 at 11:06 AM Shoaib Zafar <
> shoaib.za...@venturedive.com> wrote:
>
>> Hello Beamers.
>>
>> Just curious does anyone having trouble running
>> ':sdks:python:test-suites:tox:pycommon:docs' command locally?
>>
>> After rebasing with master recently, I am facing sphinx thread fork error
>> with on my macos mojave, using python 3.7.0.
>> I Tried to add system variable "export
>> OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" (which I found on google) but
>> no luck!
>>
>> Any suggestions/help?
>>
>> Thanks!
>>
>> Console Log:
>> --
>> 
>> Creating file target/docs/source/apache_beam.utils.proto_utils.rst.
>> Creating file target/docs/source/apache_beam.utils.retry.rst.
>> Creating file target/docs/source/apache_beam.utils.subprocess_server.rst.
>> Creating file
>> target/docs/source/apache_beam.utils.thread_pool_executor.rst.
>> Creating file target/docs/source/apache_beam.utils.timestamp.rst.
>> Creating file target/docs/source/apache_beam.utils.urns.rst.
>> Creating file target/docs/source/apache_beam.utils.rst.
>> objc[8384]: +[__NSCFConstantString initialize] may have been in progress
>> in another thread when fork() was called.
>> objc[8384]: +[__NSCFConstantString initialize] may have been in progress
>> in another thread when fork() was called. We cannot safely call it or
>> ignore it in the fork() child process. Crashing instead. Set a breakpoint
>> on objc_initializeAfterForkError to debug.
>>
>> Traceback (most recent call last):
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/cmd/build.py",
>> line 304, in build_main
>> app.build(args.force_all, filenames)
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/application.py",
>> line 335, in build
>> self.builder.build_all()
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/builders/__init__.py",
>> line 305, in build_all
>> self.build(None, summary=__('all source files'), method='all')
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/builders/__init__.py",
>> line 360, in build
>> updated_docnames = set(self.read())
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/builders/__init__.py",
>> line 466, in read
>> self._read_parallel(docnames, nproc=self.app.parallel)
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/builders/__init__.py",
>> line 521, in _read_parallel
>> tasks.join()
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/util/parallel.py",
>> line 114, in join
>> self._join_one()
>>   File
>> "/Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/lib/python3.7/site-packages/sphinx/util/parallel.py",
>> line 120, in _join_one
>> exc, logs, result = pipe.recv()
>>   File
>> "/Users/shoaib/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/connection.py",
>> line 250, in recv
>> buf = self._recv_bytes()
>>   File
>> "/Users/shoaib/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/connection.py",
>> line 407, in _recv_bytes
>> buf = self._recv(4)
>>   File
>> "/Users/shoaib/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/connection.py",
>> line 383, in _recv
>> raise EOFError
>> EOFError
>>
>> Exception occurred:
>>   File
>> "/Users/shoaib/.pyenv/versions/3.7.0/lib/python3.7/multiprocessing/connection.py",
>> line 383, in _recv
>> raise EOFError
>> EOFError
>> The full traceback has been saved in
>> /Users/shoaib/Projects/beam/newbeam/sdks/python/test-suites/tox/pycommon/build/srcs/sdks/python/target/.tox-py37-docs/py37-docs/tmp/sphinx-err-mphtfnei.log,
>> if you want to report the issue to the developers.
>> Please also report this if it was a user error, so that a better error
>> message can be provided next time.
>> A bug report can be

Re: Labels on PR

2020-02-11 Thread jincheng sun

I left comments on PR, the main suggestion is that we may need a discussion
about what kind of labels should be add. I would like to share my thoughts
as follows:

I think we need to add labels according to some rules. For example, the
easiest way is to add labels by languages, java / python / go etc. But this
kind of help is very limited, so we need to subdivide some labels, such as
by components. Currently we have more than 70 components, each component is
configured with labels, and it seems cumbersome. So we should have some
rules for dividing labels, which can play the role of labels without being
too cumbersome. Such as:

We can add `extensions` or `extensions-ideas and extensions-java` for the
following components:

- extensions-ideas
- extensions-java-join-library
- extensions-java-json
- extensions-java-protobuf
- extensions-java-sketching
- extensions-java-sorter

And it's better to add a label for each Runner as follows:

- runner-apex
- runner-core
- runner-dataflow
- runner-direct
- runner-flink
- runner-jstorm
- runner-...

So, I think would be great to collect feedbacks from the community on the
set of labels needed.

What do you think?

Best,
Jincheng

Alex Van Boxel  于2020年2月11日周二 下午3:11写道：

> I've opened a PR and a ticket with INFRA.
>
> PR: https://github.com/apache/beam/pull/10824
>
>  _/
> _/ Alex Van Boxel
>
>
> On Tue, Feb 11, 2020 at 6:57 AM jincheng sun 
> wrote:
>
>> +1. Autolabeler seems really cool and it seems that it's simple to
>> configure and set up.
>>
>> Best,
>> Jincheng
>>
>>
>>
>> Udi Meiri  于2020年2月11日周二 上午2:01写道：
>>
>>> Cool!
>>>
>>> On Mon, Feb 10, 2020 at 9:27 AM Robert Burke  wrote:
>>>
>>>> +1 to autolabeling
>>>>
>>>> On Mon, Feb 10, 2020, 9:21 AM Luke Cwik  wrote:
>>>>
>>>>> Nice
>>>>>
>>>>> On Mon, Feb 10, 2020 at 2:52 AM Alex Van Boxel 
>>>>> wrote:
>>>>>
>>>>>> Ha, cool. I'll have a look at the autolabeler. The infra stuff is not
>>>>>> something I've looked at... I'll dive into that.
>>>>>>
>>>>>>  _/
>>>>>> _/ Alex Van Boxel
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 10, 2020 at 11:49 AM Ismaël Mejía 
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> You don't need to write your own action, there is already one
>>>>>>> autolabeler action [1].
>>>>>>> INFRA can easily configure it for Beam (as they did for Avro
>>>>>>> [2]) if we request it.
>>>>>>> The plugin is quite easy to configure and works like a charm [3].
>>>>>>>
>>>>>>> [1] https://github.com/probot/autolabeler
>>>>>>> [1] https://issues.apache.org/jira/browse/INFRA-17367
>>>>>>> [2]
>>>>>>> https://github.com/apache/avro/blob/master/.github/autolabeler.yml
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 10, 2020 at 11:20 AM Alexey Romanenko <
>>>>>>> aromanenko@gmail.com> wrote:
>>>>>>>
>>>>>>>> Great initiative, thanks Alex! I was thinking to add such labels
>>>>>>>> into PR title but I believe that GitHub labels are better since it can 
>>>>>>>> be
>>>>>>>> used easily for filtering, for example.
>>>>>>>>
>>>>>>>> Maybe it could be useful to add more granulation for labels, like
>>>>>>>> “release”, “runners”, “website”, etc but I’m afraid to make the titles 
>>>>>>>> too
>>>>>>>> heavy because of this.
>>>>>>>>
>>>>>>>> > On 10 Feb 2020, at 08:35, Alex Van Boxel 
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > I've started putting labels on PR's. I've done the first page for
>>>>>>>> now (as I'm afraid putting them on older once could affect the stale 
>>>>>>>> bot. I
>>>>>>>> hope this is ok.
>>>>>>>> >
>>>>>>>> > For now I'm only focussing on language and I'm going to see if I
>>>>>>>> can write a GitLab action for it. I hope this is useful. Other kind of
>>>>>>>> suggestions for labels, that can be automated, are welcome.
>>>>>>>> >
>>>>>>>> > 
>>>>>>>> >  _/
>>>>>>>> > _/ Alex Van Boxel
>>>>>>>>
>>>>>>>>

Re: Labels on PR

2020-02-10 Thread jincheng sun

+1. Autolabeler seems really cool and it seems that it's simple to
configure and set up.

Best,
Jincheng



Udi Meiri  于2020年2月11日周二 上午2:01写道：

> Cool!
>
> On Mon, Feb 10, 2020 at 9:27 AM Robert Burke  wrote:
>
>> +1 to autolabeling
>>
>> On Mon, Feb 10, 2020, 9:21 AM Luke Cwik  wrote:
>>
>>> Nice
>>>
>>> On Mon, Feb 10, 2020 at 2:52 AM Alex Van Boxel  wrote:
>>>
 Ha, cool. I'll have a look at the autolabeler. The infra stuff is not
 something I've looked at... I'll dive into that.

  _/
 _/ Alex Van Boxel


 On Mon, Feb 10, 2020 at 11:49 AM Ismaël Mejía 
 wrote:

> +1
>
> You don't need to write your own action, there is already one
> autolabeler action [1].
> INFRA can easily configure it for Beam (as they did for Avro [2]) if
> we request it.
> The plugin is quite easy to configure and works like a charm [3].
>
> [1] https://github.com/probot/autolabeler
> [1] https://issues.apache.org/jira/browse/INFRA-17367
> [2] https://github.com/apache/avro/blob/master/.github/autolabeler.yml
>
>
> On Mon, Feb 10, 2020 at 11:20 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Great initiative, thanks Alex! I was thinking to add such labels into
>> PR title but I believe that GitHub labels are better since it can be used
>> easily for filtering, for example.
>>
>> Maybe it could be useful to add more granulation for labels, like
>> “release”, “runners”, “website”, etc but I’m afraid to make the titles 
>> too
>> heavy because of this.
>>
>> > On 10 Feb 2020, at 08:35, Alex Van Boxel  wrote:
>> >
>> > I've started putting labels on PR's. I've done the first page for
>> now (as I'm afraid putting them on older once could affect the stale 
>> bot. I
>> hope this is ok.
>> >
>> > For now I'm only focussing on language and I'm going to see if I
>> can write a GitLab action for it. I hope this is useful. Other kind of
>> suggestions for labels, that can be automated, are welcome.
>> >
>> > 
>> >  _/
>> > _/ Alex Van Boxel
>>
>>

Re: [ANNOUNCE] New committer: Michał Walenia

2020-01-28 Thread jincheng sun

Congrats Michał ！


Hannah Jiang 于2020年1月29日 周三01:43写道：

> Congrats you Michal!
>
>
>
> On Tue, Jan 28, 2020 at 9:11 AM Gleb Kanterov  wrote:
>
>> Congratulations!
>>
>> On Tue, Jan 28, 2020 at 6:03 PM Łukasz Gajowy  wrote:
>>
>>> Congratulations Michał! 
>>>
>>> wt., 28 sty 2020 o 16:33 Ryan Skraba  napisał(a):
>>>
 Congratulations!

 On Tue, Jan 28, 2020 at 11:26 AM Jan Lukavský  wrote:

> Congrats Michał!
> On 1/28/20 11:16 AM, Katarzyna Kucharczyk wrote:
>
> Congratulations Michał!  
>
> On Tue, Jan 28, 2020 at 9:29 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Congrats, Michał!
>>
>> On 28 Jan 2020, at 09:20, Ismaël Mejía  wrote:
>>
>> Congratulations Michał, well deserved!
>>
>> On Tue, Jan 28, 2020 at 8:54 AM Kamil Wasilewski <
>> kamil.wasilew...@polidea.com> wrote:
>>
>>> Congrats, Michał!
>>>
>>> On Tue, Jan 28, 2020 at 3:03 AM Udi Meiri  wrote:
>>>
 Congratulations Michał!

 On Mon, Jan 27, 2020 at 3:49 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Michał!
>
> On Mon, Jan 27, 2020 at 2:59 PM Reza Rokni  wrote:
>
>> Congratulations buddy!
>>
>> On Tue, 28 Jan 2020, 06:52 Valentyn Tymofieiev, <
>> valen...@google.com> wrote:
>>
>>> Congratulations, Michał!
>>>
>>> On Mon, Jan 27, 2020 at 2:24 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 Nice -- keep up the good work!

 On Mon, Jan 27, 2020 at 2:02 PM Mikhail Gryzykhin <
 mig...@google.com> wrote:
 >
 > Congratulations Michal!
 >
 > --Mikhail
 >
 > On Mon, Jan 27, 2020 at 1:01 PM Kyle Weaver <
 kcwea...@google.com> wrote:
 >>
 >> Congratulations Michał! Looking forward to your future
 contributions :)
 >>
 >> Thanks,
 >> Kyle
 >>
 >> On Mon, Jan 27, 2020 at 12:47 PM Pablo Estrada <
 pabl...@google.com> wrote:
 >>>
 >>> Hi everyone,
 >>>
 >>> Please join me and the rest of the Beam PMC in welcoming a
 new committer: Michał Walenia
 >>>
 >>> Michał has contributed to Beam in many ways, including the
 performance testing infrastructure, and has even spoken at events 
 about
 Beam.
 >>>
 >>> In consideration of his contributions, the Beam PMC trusts
 him with the responsibilities of a Beam committer[1].
 >>>
 >>> Thanks for your contributions Michał!
 >>>
 >>> Pablo, on behalf of the Apache Beam PMC.
 >>>
 >>> [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>> --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-

Re: [ANNOUNCE] New committer: Hannah Jiang

2020-01-28 Thread jincheng sun

Congrats Hannah!

Ankur Goenka 于2020年1月29日 周三11:35写道：

> Congrats Hannah!
>
> On Tue, Jan 28, 2020 at 7:30 PM Reza Rokni  wrote:
>
>> Congratz!
>>
>> On Wed, 29 Jan 2020 at 09:52, Valentyn Tymofieiev 
>> wrote:
>>
>>> Congratulations, Hannah!
>>>
>>> On Tue, Jan 28, 2020 at 5:46 PM Udi Meiri  wrote:
>>>
 Welcome and congrats Hannah!

 On Tue, Jan 28, 2020 at 4:52 PM Robin Qiu  wrote:

> Congratulations, Hannah!
>
> On Tue, Jan 28, 2020 at 4:50 PM Alan Myrvold 
> wrote:
>
>> Congrats, Hannah
>>
>> On Tue, Jan 28, 2020 at 4:46 PM Connell O'Callaghan <
>> conne...@google.com> wrote:
>>
>>> Thank you for sharing Luke!!!
>>>
>>> Well done and congratulations Hannah!!
>>>
>>> On Tue, Jan 28, 2020 at 4:45 PM Heejong Lee 
>>> wrote:
>>>
 Congratulations! :)

 On Tue, Jan 28, 2020 at 4:43 PM Yichi Zhang 
 wrote:

> Congrats Hannah!
>
> On Tue, Jan 28, 2020 at 3:57 PM Yifan Zou 
> wrote:
>
>> Congratulations Hannah!!
>>
>> On Tue, Jan 28, 2020 at 3:55 PM Boyuan Zhang 
>> wrote:
>>
>>> Thanks for all your contributions! Congratulations~
>>>
>>> On Tue, Jan 28, 2020 at 3:44 PM Pablo Estrada <
>>> pabl...@google.com> wrote:
>>>
 yoooho : D

 On Tue, Jan 28, 2020 at 3:21 PM Luke Cwik 
 wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Hannah Jiang
>
> Hannah has contributed to Beam in many ways, including work on
> building and releasing the Apache Beam SDK containers.
>
> In consideration of their contributions, the Beam PMC trusts
> them with the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Hannah!
>
> Luke, on behalf of the Apache Beam PMC.
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
> --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-

Re: Beam's Avro 1.8.x dependency

2020-01-15 Thread jincheng sun

I found that there are several dependencies shaded and planned to made as
vendored artifacts in [1]. I'm not sure why Avro is not shaded before. From
my point of view, it's a good idea to shade Avro and make it a vendored
artifact if there are no special reasons blocking us to do that. Regarding
to how to create a vendored artifact, you can refer to [2] for more
details.

Best,
Jincheng

[1] https://issues.apache.org/jira/browse/BEAM-5819
[2] https://github.com/apache/beam/blob/master/vendor/README.md


Tomo Suzuki  于2020年1月16日周四 下午1:18写道：

> I've been upgrading dependencies around gRPC. This Avro-problem is
> interesting to me.
> I'll study BEAM-8388 more tomorrow.
>
> On Wed, Jan 15, 2020 at 10:51 PM Luke Cwik  wrote:
> >
> > +Tomo Suzuki +jincheng sun
> > There have been a few contributors upgrading the dependencies and
> validating things not breaking by running the majority of the post commit
> integration tests and also using the linkage checker to show that we aren't
> worse off with respect to our dependency tree. Reaching out to them to help
> your is your best bet of getting these upgrades through.
> >
> > On Wed, Jan 15, 2020 at 6:52 PM Aaron Dixon  wrote:
> >>
> >> I meant to mention that we must use Avro 1.9.x as we rely on some
> schema resolution fixes not present in 1.8.x - so am indeed blocked.
> >>
> >> On Wed, Jan 15, 2020 at 8:50 PM Aaron Dixon  wrote:
> >>>
> >>> It looks like Avro version dependency from Beam has come up in the
> past [1, 2].
> >>>
> >>> I'm currently on Beam 2.16.0, which has been compatible with my usage
> of Avro 1.9.x.
> >>>
> >>> But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0
> has some dependencies on Avro classes only available in 1.8.x.
> >>>
> >>> Wondering if anyone else is similar blocked and what it would take to
> prioritize Beam upgrading to 1.9.x or better using a shaded version so that
> clients can use their own Avro version for their own coding purposes. (Eg,
> I parse Avro messages from a KafkaIO source and need 1.9.x for this but am
> perfectly happy if Beam's Avro coding facilities used a shaded other
> version.)
> >>>
> >>> I've made a comment on BEAM-8388 [1] to this effect. But polling
> community for discussion.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/BEAM-8388
> >>> [2] https://github.com/apache/beam/pull/9779
> >>>
>
>
> --
> Regards,
> Tomo
>

Re: [PROPOSAL] gRPC Vendor Release

2020-01-14 Thread jincheng sun

Thanks Luke, and good to saw that the vendored Dependencies already
Released.

Best,
Jincheng


Luke Cwik  于2020年1月10日周五 上午2:09写道：

> Thanks for the work, I'll nominate myself as the release manager for
> beam-vendor-grpc-1_26_0.
>
> On Wed, Jan 8, 2020 at 5:38 PM jincheng sun 
> wrote:
>
>> Hi all,
>>
>> The PR of [1]  which bump the version of gRPC to 1.26.0 has been merged.
>> Would be great to push forward the gRPC Vendor Release process now.
>>
>> And I am appreciate if someone of committer could be the release manager
>> to help with the release process.
>>
>> Best,
>> Jincheng
>>
>> [1] https://github.com/apache/beam/pull/10463
>>
>> jincheng sun  于2019年12月26日周四 下午3:53写道：
>>
>>> Correct the PR link, PR[3] is https://github.com/apache/beam/pull/10463
>>>
>>>
>>> jincheng sun  于2019年12月26日周四 下午3:49写道：
>>>
>>>> Hi folks,
>>>>
>>>> As the problem in [1] mentioned, more detail can be found in the
>>>> discussion [2]. We should make a gRPC vendor dependencies release process
>>>> separately after the PR [3] be merged.
>>>>
>>>> I would like to propose a vendor release for gRPC. According the vendor
>>>> release guide [4], we need a favor from a committer who would like being a
>>>> release manager to help with the release process.
>>>>
>>>> We have to reach consensus before starting a release, and looking
>>>> forward to your feedback!
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-9030
>>>> [2]
>>>> https://lists.apache.org/thread.html/ef5b24766d94d3d389bee9c03e59003b9cf417c81cde50ede5856ad7%40%3Cdev.beam.apache.org%3E
>>>> [3] https://github.com/apache/beam/pull/10462
>>>> [4] https://s.apache.org/beam-release-vendored-artifacts
>>>>
>>>

Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread jincheng sun

Thank you Mikhail!

Yichi Zhang 于2020年1月11日 周六09:09写道：

> Thank you Mikahil!
>
> On Fri, Jan 10, 2020 at 12:52 PM Ahmet Altay  wrote:
>
>> Thank you Mikhail!
>>
>> On Fri, Jan 10, 2020 at 12:40 PM Kyle Weaver  wrote:
>>
>>> Hooray! Thanks to Mikhail and everyone else who contributed.
>>>
>>> On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels 
>>> wrote:
>>>
 At last :) Thank you for making it happen Mikhail! Also thanks to
 everyone else who tested the release candidate.

 Cheers,
 Max

 On 10.01.20 19:01, Mikhail Gryzykhin wrote:
 > The Apache Beam team is pleased to announce the release of version
 2.17.0.
 >
 > Apache Beam is an open source unified programming model to define and
 > execute data processing pipelines, including ETL, batch and stream
 > (continuous) processing. See https://beam.apache.org
 > 
 >
 > You can download the release here:
 >
 > https://beam.apache.org/get-started/downloads/
 >
 > This release includes bug fixes, features, and improvements detailed
 on
 > the Beam blog:
 https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
 > 
 >
 > Thanks to everyone who contributed to this release, and we hope you
 > enjoy using Beam 2.17.0.

>>> --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-

Re: [VOTE] Vendored Dependencies Release

2020-01-09 Thread jincheng sun

+1，checked list as follows：
 - verified the hash and signature
 - verified that there is no linkage errors
 - verified that the content of the pom is expected: the shaded
dependencies are not exposed, the scope of the logging dependencies are
runtime, etc.

Best，
Jincheng

Kenneth Knowles 于2020年1月10日 周五12:29写道：

> +1
>
> On Thu, Jan 9, 2020 at 4:03 PM Ahmet Altay  wrote:
>
>> +1
>>
>> On Thu, Jan 9, 2020 at 2:04 PM Pablo Estrada  wrote:
>>
>>> +1
>>>
>>> verified sha1 and md5 hashes.
>>>
>>> On Thu, Jan 9, 2020 at 10:28 AM Luke Cwik  wrote:
>>>
 +1

 I validated that no classes appeared outside of the
 org.apache.beam.vendor.grpc.v1p26p0 namespace and I also validated that the
 linkage checker listed no potential linkage errors.

 On Thu, Jan 9, 2020 at 10:25 AM Luke Cwik  wrote:

> Please review the release of the following artifacts that we vendor:
>  * beam-vendor-grpc-1_26_0
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version
> 0.1, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with
> fingerprint EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
> * all artifacts to be deployed to the Maven Central Repository [3],
> * commit hash "e60d49bdf1ed85e8f3efa1da784227f381a9e085" [4],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1089/
> [4]
> https://github.com/apache/beam/commit/e60d49bdf1ed85e8f3efa1da784227f381a9e085
>
 --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-

Re: [PROPOSAL] gRPC Vendor Release

2020-01-08 Thread jincheng sun

Hi all,

The PR of [1]  which bump the version of gRPC to 1.26.0 has been merged.
Would be great to push forward the gRPC Vendor Release process now.

And I am appreciate if someone of committer could be the release manager to
help with the release process.

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10463

jincheng sun  于2019年12月26日周四 下午3:53写道：

> Correct the PR link, PR[3] is https://github.com/apache/beam/pull/10463
>
>
> jincheng sun  于2019年12月26日周四 下午3:49写道：
>
>> Hi folks,
>>
>> As the problem in [1] mentioned, more detail can be found in the
>> discussion [2]. We should make a gRPC vendor dependencies release process
>> separately after the PR [3] be merged.
>>
>> I would like to propose a vendor release for gRPC. According the vendor
>> release guide [4], we need a favor from a committer who would like being a
>> release manager to help with the release process.
>>
>> We have to reach consensus before starting a release, and looking forward
>> to your feedback!
>>
>> Best,
>> Jincheng
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-9030
>> [2]
>> https://lists.apache.org/thread.html/ef5b24766d94d3d389bee9c03e59003b9cf417c81cde50ede5856ad7%40%3Cdev.beam.apache.org%3E
>> [3] https://github.com/apache/beam/pull/10462
>> [4] https://s.apache.org/beam-release-vendored-artifacts
>>
>

Re: [DISCUSS] Bump the version of GRPC from 1.21.0 to 1.22.0+ (May be the latest 1.26.0?)

2019-12-27 Thread jincheng sun

Hi Udi,

Thanks for your confirmation, and I have opened the PR [1] which
addresses all of the issues in BEAM-4938 mentioned. Appreciate if you have
time to have a look, welcome any comments :)

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10463


Udi Meiri  于2019年12月27日周五 上午2:31写道：

> Probably best to take one of these existing update bugs:
> https://issues.apache.org/jira/browse/BEAM-4938
> There's no discussion on these bugs, so I'd go with 1.26.0 unless someone
> else has an objection.
>
> On Mon, Dec 23, 2019 at 10:04 PM jincheng sun 
> wrote:
>
>> Hi folks,
>>
>> When submitting a Python word count job to a Flink session/standalone
>> cluster repeatedly, the meta space usage of the task manager of the Flink
>> cluster will continuously increase (about 40MB each time). The reason is
>> that the Beam classes are loaded with the user class loader(child-first by
>> default) in Flink and there is a minor problem with the implementation of
>> `ProcessManager`(from Beam) and `ThreadPoolCache`(from Netty) which may
>> cause the user class loader could not be garbage collected even after the
>> job finished which causes the meta space memory leak eventually. You can
>> refer to FLINK-15338[1] for more information.
>>
>> Regarding to `ProcessManager`, I have created a JIRA BEAM-9006[2] to
>> track it. Regarding to `ThreadPoolCache`, it is a Netty problem and has
>> been fixed in NETTY#8955[3]. Netty 4.1.35 Final has already included this
>> fix and GRPC 1.22.0 has already dependents on Netty 4.1.35 Final. So we
>> need to bump the version of GRPC to 1.22.0+ (currently 1.21.0).
>>
>> My proposal is to upgrade the GRPC version to the 1.22.0+ (May be the
>> latest 1.26.0?)
>>
>> I've created JIRA [4], but I'm not sure if there will be any other
>> problems with the bump the version of GRPC up. So, I'd like to bring up
>> this discussion and welcome your feedback !
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-15338
>> [2] https://issues.apache.org/jira/browse/BEAM-9006
>> [3] https://github.com/netty/netty/pull/8955
>> [4] https://issues.apache.org/jira/browse/BEAM-9030
>>
>> Best,
>> Jincheng
>>
>

Re: [PROPOSAL] gRPC Vendor Release

2019-12-25 Thread jincheng sun

Correct the PR link, PR[3] is https://github.com/apache/beam/pull/10463


jincheng sun  于2019年12月26日周四 下午3:49写道：

> Hi folks,
>
> As the problem in [1] mentioned, more detail can be found in the
> discussion [2]. We should make a gRPC vendor dependencies release process
> separately after the PR [3] be merged.
>
> I would like to propose a vendor release for gRPC. According the vendor
> release guide [4], we need a favor from a committer who would like being a
> release manager to help with the release process.
>
> We have to reach consensus before starting a release, and looking forward
> to your feedback!
>
> Best,
> Jincheng
>
> [1] https://issues.apache.org/jira/browse/BEAM-9030
> [2]
> https://lists.apache.org/thread.html/ef5b24766d94d3d389bee9c03e59003b9cf417c81cde50ede5856ad7%40%3Cdev.beam.apache.org%3E
> [3] https://github.com/apache/beam/pull/10462
> [4] https://s.apache.org/beam-release-vendored-artifacts
>

[PROPOSAL] gRPC Vendor Release

2019-12-25 Thread jincheng sun

Hi folks,

As the problem in [1] mentioned, more detail can be found in the discussion
[2]. We should make a gRPC vendor dependencies release process separately
after the PR [3] be merged.

I would like to propose a vendor release for gRPC. According the vendor
release guide [4], we need a favor from a committer who would like being a
release manager to help with the release process.

We have to reach consensus before starting a release, and looking forward
to your feedback!

Best,
Jincheng

[1] https://issues.apache.org/jira/browse/BEAM-9030
[2]
https://lists.apache.org/thread.html/ef5b24766d94d3d389bee9c03e59003b9cf417c81cde50ede5856ad7%40%3Cdev.beam.apache.org%3E
[3] https://github.com/apache/beam/pull/10462
[4] https://s.apache.org/beam-release-vendored-artifacts

[DISCUSS] Bump the version of GRPC from 1.21.0 to 1.22.0+ (May be the latest 1.26.0?)

2019-12-23 Thread jincheng sun

Hi folks,

When submitting a Python word count job to a Flink session/standalone
cluster repeatedly, the meta space usage of the task manager of the Flink
cluster will continuously increase (about 40MB each time). The reason is
that the Beam classes are loaded with the user class loader(child-first by
default) in Flink and there is a minor problem with the implementation of
`ProcessManager`(from Beam) and `ThreadPoolCache`(from Netty) which may
cause the user class loader could not be garbage collected even after the
job finished which causes the meta space memory leak eventually. You can
refer to FLINK-15338[1] for more information.

Regarding to `ProcessManager`, I have created a JIRA BEAM-9006[2] to track
it. Regarding to `ThreadPoolCache`, it is a Netty problem and has been
fixed in NETTY#8955[3]. Netty 4.1.35 Final has already included this fix
and GRPC 1.22.0 has already dependents on Netty 4.1.35 Final. So we need to
bump the version of GRPC to 1.22.0+ (currently 1.21.0).

My proposal is to upgrade the GRPC version to the 1.22.0+ (May be the
latest 1.26.0?)

I've created JIRA [4], but I'm not sure if there will be any other problems
with the bump the version of GRPC up. So, I'd like to bring up this
discussion and welcome your feedback !

[1] https://issues.apache.org/jira/browse/FLINK-15338
[2] https://issues.apache.org/jira/browse/BEAM-9006
[3] https://github.com/netty/netty/pull/8955
[4] https://issues.apache.org/jira/browse/BEAM-9030

Best,
Jincheng

Re: [PROPOSAL] python precommit timeouts

2019-12-23 Thread jincheng sun

Big +1 on it.

Thanks a lot for the improvement Udi !

It also make sense to have this timeout for other tests like Max said. I'm
thinking whether there are any timeout configurations for Jenkins, in this
case, the timeout config could be applied to all tests if necessary.

Best,
Jincheng


Robert Bradshaw  于2019年12月21日周六 上午7:21写道：

> On Fri, Dec 20, 2019 at 3:15 PM Udi Meiri  wrote:
> >
> > ITs will have a different timeout, but they're still not migrated to
> pytest so unaffected at the moment.
> >
> > So I created a PR and already seemed to find an issue. One test timed
> out while scanning the local filesystem.
> > It seems that it was scanning /tmp, which for apache-beam-jenkins-9 has
> 400k files (test output filenames are like:
> /tmp/tmpnv2uyqas.result-chars-0-of-1).
>
> Sounds like something that should be fixed (cleaning up jenkins, and
> fixing our tests to clean up after themselves, and requiring expensive
> scans of of /tmp).
>
> Overall huge +1 to setting per-test timeouts.
>
> > On Fri, Dec 20, 2019 at 9:41 AM Pablo Estrada 
> wrote:
> >>
> >> big +1!
> >>
> >> As Ahmet suggested, the IT-marked tests may need to have a different
> timeout. But other than that, I think this is great.
> >>
> >> On Fri, Dec 20, 2019 at 9:39 AM Udi Meiri  wrote:
> >>>
> >>> https://issues.apache.org/jira/browse/BEAM-9009
> >>>
> >>> On Fri, Dec 20, 2019 at 6:18 AM Maximilian Michels 
> wrote:
> 
>  +1 Good idea. We should also have this for Java if possible.
> 
>  On 20.12.19 02:59, Ahmet Altay wrote:
>  > This sounds reasonable. Would this be configurable per-test if
> needed?
>  >
>  > On Thu, Dec 19, 2019 at 5:52 PM Udi Meiri   > > wrote:
>  >
>  > Looking at this console log
>  > <
> https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/timestamps/?time=HH:mm:ss=GMT-8=en_US
> >,
>  > it seems that some pytests got stuck (or slowed down
> considerably).
>  > I'd like to put a 10 minute default timeout on all unit tests,
> using
>  > the pytest-timeout 
> plugin.
>  >
>

Re: [VOTE] Release 2.17.0, release candidate #2

2019-12-18 Thread jincheng sun

Thanks for drive this release Mikhail !

I have found there is an incorrect release version for release notes in
PR[1], also left a question in PR[2].

But I do not think it's the blocker of the release :)

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10401
[2] https://github.com/apache/beam/pull/10402


Ahmet Altay  于2019年12月19日周四 上午3:31写道：

> I validated python quickstarts with python 2. Wheels file are missing but
> they work otherwise. Once the wheel files are added I will add my vote.
>
> On Wed, Dec 18, 2019 at 10:00 AM Luke Cwik  wrote:
>
>> I verified the release and ran the quickstarts and found that release
>> 2.16 broke Apache Nemo runner which is also an issue for 2.17.0 RC #2. It
>> is caused by a backwards incompatible change in ParDo.MultiOutput where
>> getSideInputs return value was changed from List to Map as part of
>> https://github.com/apache/beam/pull/9275. I filed
>> https://issues.apache.org/jira/browse/BEAM-8989 to track the issue.
>>
>> Should we re-add the method back in 2.17.0 renaming the newly added
>> method to something else and also patch 2.16.0 with a minor change
>> including the same fix (breaking 2.16.0 users who picked up the new method)
>> or leave as is?
>>
>
> I suggest not fixing this for 2.17, because the issue already exists in
> 2.16 and there are two releases in parallel and it would be fine to fix
> this for 2.18 or 2.19.
>
> +Reuven Lax , who merged the mentioned PR.
>
>
>>
>> On Tue, Dec 17, 2019 at 12:13 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>>
>>> Please review and vote on the release candidate #2 for the version
>>> 2.17.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>>
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>>
>>> * JIRA release notes [1],
>>>
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint
>>> 53F72D4EEEF306D97736FE1065ABB07A8965E788
>>>
>>>  [3],
>>>
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>
>>> * source code tag "v2.17.0-RC2" [5],
>>>
>>> * website pull request listing the release [6], publishing the API
>>> reference manual [7], and the blog post [8].
>>>
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> * Validation sheet with a tab for 2.17.0 release to help with validation
>>> [9].
>>>
>>> * Docker images published to Docker Hub [10].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>>
>>> --Mikhail
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12345970=12319527
>>>
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.17.0/
>>>
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1087/
>>>
>>> [5] https://github.com/apache/beam/tree/v2.17.0-RC2
>>>
>>> [6] https://github.com/apache/beam/pull/10401
>>>
>>> [7] https://github.com/apache/beam-site/pull/594
>>>
>>> [8] https://github.com/apache/beam/pull/10402
>>>
>>> [9]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=803858785
>>>
>>> [10] https://hub.docker.com/u/apachebeam
>>>
>>>

Re: [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2019-12-13 Thread jincheng sun

+1 （non-binding）

Alex Van Boxel 于2019年12月13日 周五16:21写道：

> +1
>
> On Fri, Dec 13, 2019, 05:58 Kenneth Knowles  wrote:
>
>> Please vote on the proposal for Beam's mascot to be the Firefly. This
>> encompasses the Lampyridae family of insects, without specifying a genus or
>> species.
>>
>> [ ] +1, Approve Firefly being the mascot
>> [ ] -1, Disapprove Firefly being the mascot
>>
>> The vote will be open for at least 72 hours excluding weekends. It is
>> adopted by at least 3 PMC +1 approval votes, with no PMC -1 disapproval
>> votes*. Non-PMC votes are still encouraged.
>>
>> PMC voters, please help by indicating your vote as "(binding)"
>>
>> Kenn
>>
>> *I have chosen this format for this vote, even though Beam uses simple
>> majority as a rule, because I want any PMC member to be able to veto based
>> on concerns about overlap or trademark.
>>
> --

Best,
Jincheng
-
Committer & PMC Member at @ApacheFlink
Staff Engineer at @Alibaba
Blog: https://enjoyment.cool
Twitter: https://twitter.com/sunjincheng121
--

Re: Links to Java API docs in Beam Website documentation (Was: Version Beam Website Documentation)

2019-12-11 Thread jincheng sun

+1 for using {{site.release_latest}}, which is make more sense to me.

Best,
Jincheng

Kenneth Knowles  于2019年12月11日周三 下午1:12写道：

> +1 to site.release_latest
>
> We do have a dead link checker in the website tests. Does it not catch
> moved classes, etc?
>
> On Tue, Dec 10, 2019 at 1:49 PM Pablo Estrada  wrote:
>
>> +1 to rely on expanding {{site.release_latest}}.
>>
>> On Tue, Dec 10, 2019 at 12:05 PM Brian Hulette 
>> wrote:
>>
>>> I was thinking about this recently as well. I requested we add a link to
>>> the java API docs in a website change [1]. I searched around a bit to look
>>> for precedent on how to do this, but I found three different methods:
>>> - Links to a specific version (e.g.
>>> https://beam.apache.org/releases/javadoc/2.0.0/...)
>>> - Links to "current" (e.g.
>>> https://beam.apache.org/releases/javadoc/current/...)
>>> - Links that rely on expanding site.relase_latest (i.e.
>>> https://beam.apache.org/releases/javadoc/{{ site.release_latest}}/...)
>>>
>>> The first seems clearly bad if we want to always use the most recent
>>> version, but it does have the benefit that we don't have to worry about the
>>> links breaking after code changes.
>>>
>>> The latter two are effectively the same, but site.release_latest has the
>>> benefit that its parameterized so we _could_ generate documentation for
>>> other versions if we want to. It also seems to be the most prevalent. So I
>>> think that's the way to go. Are there any objections to updating all of our
>>> links to use site.release_latest?
>>>
>>> I think the only possible concern is we might break a link if we
>>> move/rename a class. It would be nice if there were some way to validate
>>> them.
>>>
>>> Brian
>>>
>>> [1] https://github.com/apache/beam/pull/10273#discussion_r354533080
>>>
>>>
>>> On Fri, Dec 6, 2019 at 7:20 AM Maximilian Michels 
>>> wrote:
>>>
 @Kenn This is not only about breaking changes. We can also add new
 features or settings which will then be advertised in the documentation
 but not be available in older versions.

 Having a single source of truth is easier to maintain and better
 discoverable via search engines. However, it forces us to use wordings
 like "Works like this in Beam version <= X.Y, otherwise use..". The
 pragmatic approach there is to just ignore old Beam versions. That's
 not
 super user friendly, but it works.

 IMHO the amount of version-specific content in the Beam documentation
 probably does not yet justify forking the documentation for every
 release.

 Cheers,
 Max

 On 06.12.19 08:13, Alex Van Boxel wrote:
 > It seems also be too complex for the Google Crawler as well. A lot of
 > times I arrived on documentation on an older version of a product
 when I
 > search (aka Google) for something.
 >
 >   _/
 > _/ Alex Van Boxel
 >
 >
 > On Fri, Dec 6, 2019 at 6:20 AM Kenneth Knowles >>> > > wrote:
 >
 > Since we are not making breaking changes (we hope) and we try to
 be
 > careful about performance regressions, I think it is OK to simply
 > encourage users to upgrade to the latest if they expect the
 > narrative documentation to match their version. The versioned API
 > docs are probably enough. We might consider putting more info into
 > the javadocs / pydocs to bridge the gap, if you have seen any
 issues
 > with users hitting trouble.
 >
 > I am saying this for two reasons:
 >
 >   - versioning the site is more work, and someone would need to do
 > that work
 >   - but more than that, versioned site is more complex for users
 >
 > Kenn
 >
 > On Wed, Dec 4, 2019 at 1:48 PM Ankur Goenka >>> > > wrote:
 >
 > I agree, having a single website showcase the latest beam
 > versions and encourages users to use the latest Beam version
 > which is very useful.
 > Calling out version limitations are definitely makes users
 life
 > easier.
 >
 > The usecase I have in mind is more on the lines of best
 > practices and recommended way of doing things.
 > One such example is the way we recommend new users to try
 > Portable Flink. We are overhauling and simplifying the user
 > onboarding experience. Though the old way of doing things are
 > still supported, the easier new recommendation for onboarding
 > will only apply from Beam 2.18.
 > We can ofcource create sections on documentation for this
 > usecase but it seems like a poor man's way of versioning :)
 >
 > You also highlighted a great usecase about LTS release. Should
 > we simply separate out the documentations for LTS release and
 >

Re: [PROPOSAL] Revised streaming extensions for Beam SQL

2019-12-11 Thread jincheng sun

Thanks for bring up this discussion Kenn!

Definitely +1 for the proposal.

I have left some questions in the documentation :)

Best,
Jincheng

Rui Wang  于2019年12月11日周三 上午5:23写道：

> Until now as I am not seeing more people are commenting on this proposal,
> can we consider this proposal is already accepted by Beam community?
>
> If it is accepted, I want to start a discussion on deprecate the old GROUP
> BY windowing style and only keep table-valued function windowing.
>
>
> -Rui
>
> On Thu, Jul 25, 2019 at 11:32 AM Kenneth Knowles  wrote:
>
>> We hope it does enter the SQL standard. It is one reason for coming
>> together to write this paper.
>>
>> OVER clause is mentioned often.
>>
>>  - TUMBLE can actually just be a function so you don't need OVER or any
>> of the fancy stuff we propose; it is just done to make them all look similar
>>  - HOP still doesn't work since OVER clause has one value per input row,
>> it is still 1 to 1 input/output ratio
>>  - SESSION GAP 5 MINUTES (PARTITION BY key) is actually a natural syntax
>> that could work well
>>
>> None of them require ORDER, by design.
>>
>> On the other hand, implementing the general OVER clause and the rank,
>> running sum, etc, could be done with GBK + sort values. That is not related
>> to windowing. And since in SQL users of windowing will think of OVER as
>> related to ordering, I personally don't want to also use it for something
>> that has nothing to do with ordering.
>>
>> But if you would write up something that could be interesting to discuss
>> more.
>>
>> Kenn
>>
>> On Wed, Jul 24, 2019 at 2:24 PM Mingmin Xu  wrote:
>>
>>> +1 to remove those magic words in Calcite streaming SQL, just because
>>> they're not SQL standard. The idea to replace HOP/TUMBLE with
>>> table-view-functions makes it concise, my only question is, is it(or will
>>> it be) part of SQL standard? --I'm a big fan to align with standards :lol
>>>
>>> Ps, although the concept of `window` used here are different from window
>>> function in SQL, the syntax gives some insight. Take the example of 
>>> `ROW_NUMBER()
>>> OVER (PARTITION BY COL1 ORDER BY COL2) AS row_number`, `ROW_NUMBER()`
>>> assigns a sequence value for records in subgroup with key 'COL1'. We can
>>> introduce another function, like TUMBLE() which will assign a window
>>> instance(more instances for HOP()) for the record.
>>>
>>> Mingmin
>>>
>>>
>>> On Sun, Jul 21, 2019 at 9:42 PM Manu Zhang 
>>> wrote:
>>>
 Thanks Kenn,
 great paper and left some newbie questions on the proposal.

 Manu

 On Fri, Jul 19, 2019 at 1:51 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> I recently had the great privilege to work with others from Beam plus
> Calcite and Flink SQL contributors to build a new and minimal proposal for
> adding streaming extensions to standard SQL: event time, watermarks,
> windowing, triggers, stream materialization.
>
> We hope this will influence the standard body and also Calcite and
> Flink and other projects working on the streaming SQL.
>
> I would like to start implementing these extensions in Beam, moving
> from our current streaming extensions to the new proposal.
>
>The whole paper is https://arxiv.org/abs/1905.12133
>
>My small proposal to start in Beam:
> https://s.apache.org/streaming-beam-sql
>
> TL;DR: replace `GROUP BY Tumble/Hop/Session` with table functions that
> do Tumble, Hop, Session. The details of why to make this change are
> explained in the appendix to my proposal. For the big picture of how it
> fits in, the full paper is best.
>
> Kenn
>

>>>
>>> --
>>> 
>>> Mingmin
>>>
>>

Re: [DISCUSS] BIP reloaded

2019-12-10 Thread jincheng sun

Thanks for bring up this discussion Jan!

+1 for cearly define BIP for beam.

And I think would be nice to initialize a concept document for BIP. Just a
reminder: the document may contains:

- How many kinds of improvement in beam.
- What kind of improvement should to create a BIP.
- What should be included in a BIP.
- Who can create the BIP.
- Who can participate in the discussion of BIP and who can vote for BIP.
- What are the possible limitations of BiP, such as whether it is necessary
to complete the dev of BIP  in one release.
- How to track a BIP.

Here is a question: I found out a policy[1] in beam, but only contains the
poilcy of release , my question is does beam have something called Bylaws?
Similar as Flink[1].

Anyway, I like your proposals Jan :)

Best,
Jincheng
[1] https://beam.apache.org/community/policies/
[2]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Bylaws#FlinkBylaws-Approvals


David Morávek  于2019年12月10日周二 下午2:33写道：

> Hi Jan,
>
> I think this is more pretty much what we currently do, just a little bit
> more transparent for the community. If the process is standardized, it can
> open doors for bigger contributions from people not familiar with the
> process. Also it's way easier to track progress of BIPs, than documents
> linked from the mailing list.
>
> Big +1 ;)
>
> D.
>
> On Sun, Dec 8, 2019 at 12:42 PM Jan Lukavský  wrote:
>
>> Hi,
>>
>> I'd like to revive a discussion that was taken some year and a half ago
>> [1], which included a concept of "BIP" (Beam Improvement Proposal) - an
>> equivalent of "FLIP" (flink), "KIP" (kafka), "SPIP" (spark), and so on.
>>
>> The discussion then ended without any (public) conclusion, so I'd like
>> to pick up from there. There were questions related to:
>>
>>   a) how does the concept of BIP differ from simple plain JIRA?
>>
>>   b) what does it bring to the community?
>>
>> I'd like to outline my point of view on both of these aspects (they are
>> related).
>>
>> BIP differs from JIRA by definition of a process:
>>
>> BIP -> vote -> consensus -> JIRA -> implementation
>>
>> This process (although it might seem a little unnecessary formal) brings
>> the following benefits:
>>
>>   i) improves community's overall awareness of planned and in-progress
>> features
>>
>>   ii) makes it possible to prioritize long-term goals (create "roadmap"
>> that was mentioned in the referred thread)
>>
>>   iii) by casting explicit vote on each improvement proposal diminishes
>> the probability of wasted work - as opposed to our current state, where
>> it is hard to tell when there is a consensus and what actions need to be
>> done in order to reach one if there isn't
>>
>>   iv) BIPs that eventually pass a vote can be regarded as "to be
>> included in some short term" and so new BIPs can build upon them,
>> without the risk of having to be redefined if their dependency for
>> whatever reason don't make it to the implementation
>>
>> Although this "process" might look rigid and corporate, it actually
>> brings better transparency and overall community health. This is
>> especially important as the community grows and becomes more and more
>> distributed. There are many, many open questions in this proposal that
>> need to be clarified, my current intent is to grab a grasp about how the
>> community feels about this.
>>
>> Looking forward to any comments,
>>
>>   Jan
>>
>> [1]
>>
>> https://lists.apache.org/thread.html/4e1fffa2fde8e750c6d769bf4335853ad05b360b8bd248ad119cc185%40%3Cdev.beam.apache.org%3E
>>
>>

Re: [RELEASE] Tracking 2.18

2019-12-05 Thread jincheng sun

Thanks for the Tracking Udi!

I have updated the status of some release blockers issues as follows:

- BEAM-8733 closed
- BEAM-8620 reset the fix version to 2.19
- BEAM-8618 reset the fix version to 2.19

Best,
Jincheng

Colm O hEigeartaigh  于2019年12月5日周四 下午5:38写道：

> Could we get this one in 2.18 as well?
> https://issues.apache.org/jira/browse/BEAM-8861
>
> Colm.
>
> On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri  wrote:
>
>> Following the release calendar, I plan on cutting the 2.18 release branch
>> today.
>>
>> There are currently 8 release blockers
>> .
>>
>>

Re: [PROPOSAL] Preparing for Beam 2.18 release

2019-11-30 Thread jincheng sun

+1
Thank you Udi！

Best，
Jincheng


Valentyn Tymofieiev 于2019年11月28日 周四08:09写道：

> +1. Thanks, Udi!
>

> On Wed, Nov 27, 2019 at 12:58 PM Ahmet Altay  wrote:
>
>> Thank you Udi for keeping the release cadence. +1 to cutting 2.18.0
>> branch on time.
>>
>> On Thu, Nov 21, 2019 at 10:07 AM Udi Meiri  wrote:
>>
>>> Thanks Cham. Tomo, if there are any dependencies you believe are
>>> blockers please mark them.
>>> Also, only the sub-tasks
>>> 
>>> seem to be real upgrade tickets (so only ~150).
>>>
>>> Back to the original intent of my message, do we have consensus on who
>>> will do the 2.18 release?
>>>
>>> On Wed, Nov 20, 2019 at 6:05 PM Tomo Suzuki  wrote:
>>>
 Thank you for response.

 On Wed, Nov 20, 2019 at 16:49 Chamikara Jayalath 
 wrote:

>
>
> On Wed, Nov 20, 2019 at 1:04 PM Tomo Suzuki 
> wrote:
>
>> Hi Udi,
>>
>> (Question) I started learning how Beam dependencies are maintained
>> through releases. https://beam.apache.org/contribute/dependencies/
>>  says
>>
>>
>> *Beam community has agreed on following policies regarding upgrading
>> dependencies.*
>>
>> ...
>>
>> *A significantly outdated dependency (identified manually or through
>> the automated Jenkins job) should result in a JIRA that is a blocker for
>> the next release. Release manager may choose to push the blocker to the
>> subsequent release or downgrade from a blocker.*
>>
>>
>> Is the statement above still valid? We have ~250 automatically
>> created tickets [1] for dependency upgrade.
>>
>
> I think it's up to the release manager as mentioned in the statement.
> We surely don't want to block releases by all these JIRAs but Beam
> community and/or release manager may decide to make some of these blockers
> if needed.
> I don't think the tool automatically makes the auto generated JIRAs
> release blockers.
>
> Thanks,
> Cham
>
>
>> [1]:
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20dependencies%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC
>>
>> Regards,
>> Tomo
>>
>>
>> On Wed, Nov 20, 2019 at 3:48 PM Udi Meiri  wrote:
>>
>>> Hi all,
>>>
>>> The next (2.18) release branch cut is scheduled for Dec 4, according
>>> to the calendar
>>> 
>>> .
>>> I would like to volunteer myself to do this release.
>>> The plan is to cut the branch on that date, and cherrypick 
>>> release-blocking
>>> fixes afterwards if any.
>>>
>>> Any unresolved release blocking JIRA issues for 2.18 should have
>>> their "Fix Version/s" marked as "2.18.0".
>>>
>>> Any comments or objections?
>>>
>>>
>>
>> --
>> Regards,
>> Tomo
>>
> --
 Regards,
 Tomo

>>>

Re: [ANNOUNCE] New committer: Daniel Oliveira

2019-11-23 Thread jincheng sun

Congrats, Daniel!
Best,
Jincheng

Alexey Romanenko  于2019年11月22日周五 下午5:47写道：

> Congratulations, Daniel!
>
> On 22 Nov 2019, at 09:18, Jan Lukavský  wrote:
>
> Congrats Daniel!
> On 11/21/19 10:11 AM, Gleb Kanterov wrote:
>
> Congratulations!
>
> On Thu, Nov 21, 2019 at 6:24 AM Thomas Weise  wrote:
>
>> Congratulations!
>>
>>
>> On Wed, Nov 20, 2019, 7:56 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats!!
>>>
>>> On Wed, Nov 20, 2019 at 5:21 PM Daniel Oliveira 
>>> wrote:
>>>
 Thank you everyone! I won't let you down. o7

 On Wed, Nov 20, 2019 at 2:12 PM Ruoyun Huang  wrote:

> Congrats Daniel!
>
> On Wed, Nov 20, 2019 at 1:58 PM Robert Burke 
> wrote:
>
>> Congrats Daniel! Much deserved.
>>
>> On Wed, Nov 20, 2019, 12:49 PM Udi Meiri  wrote:
>>
>>> Congrats Daniel!
>>>
>>> On Wed, Nov 20, 2019 at 12:42 PM Kyle Weaver 
>>> wrote:
>>>
 Congrats Dan! Keep up the good work :)

 On Wed, Nov 20, 2019 at 12:41 PM Cyrus Maden 
 wrote:

> Congratulations! This is great news.
>
> On Wed, Nov 20, 2019 at 3:24 PM Rui Wang 
> wrote:
>
>> Congrats!
>>
>>
>> -Rui
>>
>> On Wed, Nov 20, 2019 at 11:48 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> Congrats, Daniel!
>>>
>>> On Wed, Nov 20, 2019 at 11:47 AM Kenneth Knowles <
>>> k...@apache.org> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Daniel Oliveira

 Daniel introduced himself to dev@ over two years ago and has
 contributed in many ways since then. Daniel has contributed to 
 general
 project health, the portability framework, and all three 
 languages: Java,
 Python SDK, and Go. I would like to particularly highlight how he 
 deleted
 12k lines of dead reference runner code [1].

 In consideration of Daniel's contributions, the Beam PMC trusts
 him with the responsibilities of a Beam committer [2].

 Thank you, Daniel, for your contributions and looking forward
 to many more!

 Kenn, on behalf of the Apache Beam PMC

 [1] https://github.com/apache/beam/pull/8380
 [2]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>
> --
> 
> Ruoyun  Huang
>
>
>

Re: [ANNOUNCE] New committer: Brian Hulette

2019-11-14 Thread jincheng sun

Congratulation Brian!

Best,
Jincheng

Kyle Weaver  于2019年11月15日周五 上午7:19写道：

> Thanks for your contributions and congrats Brian!
>
> On Thu, Nov 14, 2019 at 3:14 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Brian Hulette
>>
>> Brian introduced himself to dev@ earlier this year and has been
>> contributing since then. His contributions to Beam include explorations of
>> integration with Arrow, standardizing coders, portability for schemas, and
>> presentations at Beam events.
>>
>> In consideration of Brian's contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer [1].
>>
>> Thank you, Brian, for your contributions and looking forward to many more!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

Re: Date/Time Ranges & Protobuf

2019-11-13 Thread jincheng sun

Thanks for bringing up this discussion @Luke.

As @Kenn mentioned, in Beam we have defined the constants value for the
min/max/end of global window. I noticed that
google.protobuf.Timestamp/Duration is only used in window definitions, such
as FixedWindowsPayload, SlidingWindowsPayload, SessionsPayload, etc.

I think that both RFC 3339 and Beam's current implementation are big enough
to express a common window definitions. But users can really define a
window size that outside the scope of the RFC 3339. Conceptually, we should
not limit the time range for window(although I think the range of RPC 3339
is big enough in most cases).

To ensure that people well know the background of the discussion, hope you
don't mind that I put the original conversion thread[1] here.

Best,
Jincheng

[1] https://github.com/apache/beam/pull/10041#discussion_r344380809

Robert Bradshaw  于2019年11月12日周二 下午4:09写道：

> I agree about it being a tagged union in the model (together with
> actual_time(...) - epsilon). It's not just a performance hack though,
> it's also (as discussed elsewhere) a question of being able to find an
> embedding into existing datetime libraries. The real question here is
> whether we should limit ourselves to just these 1 years AD, or
> find value in being able to process events for the lifetime of the
> universe (or, at least, recorded human history). Artificially limiting
> in this way would seem surprising to me at least.
>
> On Mon, Nov 11, 2019 at 11:58 PM Kenneth Knowles  wrote:
> >
> > The max timestamp, min timestamp, and end of the global window are all
> performance hacks in my view. Timestamps in beam are really a tagged union:
> >
> > timestamp ::= min | max | end_of_global | actual_time(... some
> quantitative timestamp ...)
> >
> > with the ordering
> >
> > min < actual_time(...) < end_of_global < max
> >
> > We chose arbitrary numbers so that we could do simple numeric
> comparisons and arithmetic.
> >
> > Kenn
> >
> > On Mon, Nov 11, 2019 at 2:03 PM Luke Cwik  wrote:
> >>
> >> While crites@ was investigating using protobuf to represent Apache
> Beam timestamps within the TestStreamEvents, he found out that the well
> known type google.protobuf.Timestamp doesn't support certain timestamps we
> were using in our tests (specifically the max timestamp that Apache Beam
> supports).
> >>
> >> This lead me to investigate and the well known type
> google.protobuf.Timestamp supports dates/times from 0001-01-01T00:00:00Z to
> -12-31T23:59:59.9Z which is much smaller than the timestamp
> range that Apache Beam currently supports -9223372036854775ms to
> 9223372036854775ms which is about 292277BC to 294247AD (it was difficult to
> find a time range that represented this).
> >>
> >> Similarly the google.protobuf.Duration represents any time range over
> those ~1 years. Google decided to limit their range to be compatible
> with the RFC 3339[2] standard to which does simplify many things since it
> guarantees that all RFC 3339 time parsing/manipulation libraries are
> supported.
> >>
> >> Should we:
> >> A) define our own timestamp/duration types to be able to represent the
> full time range that Apache Beam can express?
> >> B) limit the valid timestamps in Apache Beam to some standard such as
> RFC 3339?
> >>
> >> This discussion is somewhat related to the efforts to support nano
> timestamps[2].
> >>
> >> 1: https://tools.ietf.org/html/rfc3339
> >> 2:
> https://lists.apache.org/thread.html/86a4dcabdaa1dd93c9a55d16ee51edcff6266eda05221acbf9cf666d@%3Cdev.beam.apache.org%3E
>

Re: Key encodings for state requests

2019-11-12 Thread jincheng sun

Thanks for sharing your thoughts which give me more help to deep
understanding the design of FnAPI, and It make more sense to me.

Great thanks Robert !

Best,
Jincheng


Robert Bradshaw  于2019年11月12日周二 上午2:10写道：

> On Fri, Nov 8, 2019 at 10:04 PM jincheng sun 
> wrote:
> >
> > > Let us first define what are "standard coders". Usually it should be
> the coders defined in the Proto. However, personally I think the coders
> defined in the Java ModelCoders [1] seems more appropriate. The reason is
> that for a coder which has already appeared in Proto and still not added to
> the Java ModelCoders, it's always replaced by the runner with
> LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
> data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
> That's to say, that coder does not still take effect in the SDK harness.
> Only when the coder is added in ModelCoders, it's 'known' and will take
> effect.
> >
> > Correct this point! The coder which is not contained in the Java
> ModelCoders is replaced with LengthPrefixCoder[ByteArrayCoder] at runner
> side and LengthPrefixCoder[CustomCoder] at SDK harness side.
> >
> > The point here is that the runner determines whether it knows the coder
> according to the coders defined in the Java ModelCoders, not the coders
> defined in the proto file. So if taking option 3, the non-standard coders
> which will be wrapped with LengthPrefixCoder should also be determined by
> the coders defined in the Java ModerCoders, not the coders defined in the
> proto file.
>
> Yes.
>
> Both as a matter of principle and pragmatics, it'd be good to avoid
> anything about the model only defined in Java files.
>
> Also, when we say "the runner" we cannot assume it's written in Java.
> While many Java OSS runners share these libraries, The Universal Local
> Runner is written in Python. Dataflow is written (primarily) in C++.
> My hope is that the FnAPI will be stable enough that one can even run
> multiple versions of the Java SDK with the same runner. What matters
> is that (1) if the same URN is used, all runners/SDKs agree on the
> encoding (2) there are certain coders (Windowed, LengthPrefixed, and
> KV come to mind) that all Runners/SDKs are required to understand, and
> (3) runners properly coerce coders they do not understand into coders
> that they do if they need to pull out and act on the bytes. The more
> coders the runner/SDK understands, the less often it needs to do this.
>
> > jincheng sun 于2019年11月9日 周六12:26写道：
> >>
> >> Hi Robert Bradshaw,
> >>
> >> Thanks a lot for the explanation. Very interesting topic!
> >>
> >> Let us first define what are "standard coders". Usually it should be
> the coders defined in the Proto. However, personally I think the coders
> defined in the Java ModelCoders [1] seems more appropriate. The reason is
> that for a coder which has already appeared in Proto and still not added to
> the Java ModelCoders, it's always replaced by the runner with
> LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
> data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
> That's to say, that coder does not still take effect in the SDK harness.
> Only when the coder is added in ModelCoders, it's 'known' and will take
> effect.
> >>
> >> So if we take option 3, the non-standard coders which will be wrapped
> with LengthPrefixCoder should be synced with the coders defined in the Java
> ModerCoders. (From this point of view, option 1 seems more clean!)
> >>
> >> Please correct me if I missed something. Thanks a lot！
> >>
> >> Best,
> >> Jincheng
> >>
> >> [1]
> https://github.com/apache/beam/blob/01726e9c62313749f9ea7c93063a1178abd1a8db/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoders.java#L59
> >>
> >> Robert Burke  于2019年11月9日周六 上午8:46写道：
> >>>
> >>> And by "I wasn't clear" I meant "I misread the options".
> >>>
> >>> On Fri, Nov 8, 2019, 4:14 PM Robert Burke  wrote:
> >>>>
> >>>> Reading back, I wasn't clear: the Go SDK does Option (1), putting the
> LP explicitly during encoding [1] for the runner proto, and explicitly
> expects LPs to contain a custom coder URN on decode for execution [2].
> (Modulo an old bug in Dataflow where the urn was empty)
> >>>>
> >>>>
> >>>> [1]
> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L348
> >&g

Re: 10,000 Pull Requests

2019-11-12 Thread jincheng sun

Congratulate Beam community, Very amazing numbers, very active community!

Best,
Jincheng


Maximilian Michels  于2019年11月8日周五 上午1:39写道：

> Yes! Keep the committer pipeline filled ;)
>
> Reviewing PRs probably remains one of the toughest problems in active
> open-source projects.
>
> On 07.11.19 18:28, Luke Cwik wrote:
> > We need more committers...
> > that review the code.
> >
> > On Wed, Nov 6, 2019 at 6:21 PM Pablo Estrada  > > wrote:
> >
> > iiipe : )
> >
> > On Thu, Nov 7, 2019 at 12:59 AM Kenneth Knowles  > > wrote:
> >
> > Awesome!
> >
> > Number of days from PR #1 and PR #1000: 211
> > Number of days from PR #9000 and PR #1: 71
> >
> > Kenn
> >
> > On Wed, Nov 6, 2019 at 6:28 AM Łukasz Gajowy  > > wrote:
> >
> > Yay! Nice! :)
> >
> > śr., 6 lis 2019 o 14:38 Maximilian Michels  > > napisał(a):
> >
> > Just wanted to point out, we have crossed the 10,000 PRs
> > mark :)
> >
> > ...and the winner is:
> > https://github.com/apache/beam/pull/1
> >
> > Seriously, I think Beam's culture to promote PRs over
> > direct access to
> > the repository is remarkable. To another 10,000 PRs!
> >
> > Cheers,
> > Max
> >
>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-11 Thread jincheng sun

Thank you all for your feedbacks.

@Kenn, @Robert I got what you mean now. I think it's good to make
window/timestamp/paneinfo values configurable while also avoid redundant
encoding and decoding.

Besides, I think the idea from @Luke is good, that we can replace
WindowedValue with T in some cases to reduce the overhead of
understanding. But maybe we can solve it in a seperate issue?

I will update the PR according to option 2 (Add a windowed value coder
whose window/timestamp/paneinfo values are specified as constants). Thanks
again.

Best,
Jincheng

Kenneth Knowles 于2019年11月9日 周六02:59写道：

>
>
> On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik  wrote:
>
>>
>>
>> On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik  wrote:
>>>
>>>> I did suggest one other alternative on Jincheng's PR[1] which was to
>>>> allow windowless values to be sent across the gRPC port. The SDK would then
>>>> be responsible for ensuring that the execution didn't access any properties
>>>> that required knowledge of the timestamp, pane or window. This is different
>>>> then adding the ValueOnlyWindowedValueCoder as a model coder because it
>>>> allows SDKs to pass around raw values to functions without any windowing
>>>> overhead which could be useful for things like the side input window
>>>> mapping or window merging functions we have.
>>>>
>>>
>>> When you say "pass around" what does it mean? If it is over the wire,
>>> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean
>>> the overhead of having the layer of boxing of WindowedValue? I would assume
>>> all non-value components of the WindowedValue from
>>> ValueOnlyWindowedValueCoder are pointers to a single shared immutable
>>> instance carried by the coder instance.
>>>
>>
>> I was referring to the layer of boxing of WindowedValue. My concern
>> wasn't the performance overhead of passing around a wrapper object but the
>> cognitive overhead of understanding why everything needs to be wrapped in a
>> windowed value. Since you have been working on SQL for some time, this
>> would be analogous to executing a UDF and making all the machinery around
>> it take WindowedValue instead of T.
>>
>>
>>> I think raw values can already be passed to functions, no? The main
>>> thing is that elements in a PCollection always have a window, timestamp,
>>> and paneinfo. Not all values are elements. Is there a specific case you
>>> have in mind? I would not expect WindowMappingFn or window merging fn to be
>>> passing "elements" but just values of the appropriate type for the function.
>>>
>>
>> This is about the machinery around WindowMappingFn/WindowMergingFn. For
>> example the implementation around WindowMappingFn takes a
>> WindowedValue and unwraps it forwarding it to the WindowMappingFn
>> and then takes the result and wraps it in a WindowedValue and
>> returns that to the runner.
>>
>
> I'm not familiar with this, but it sounds like it should not be necessary
> and is an implementation detail. Is there a model change necessary to avoid
> the unboxing/boxing? I would be surprised.
>
> Kenn
>
>
>>
>>
>>>
>>>
>>>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> I think there is some misunderstanding about what is meant by option
>>>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>>>>> whose window/timestamp/paneinfo coders are parameterized to be
>>>>> constant coders, but a WindowedValueCoder whose
>>>>> window/timestamp/paneinfo values are specified as constants in the
>>>>> coder.
>>>>>
>>>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>>>>> by a window, timestamp, and pane info instance
>>>>>
>>>>> The existing ValueOnlyWindowedValueCoder is literally
>>>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>>>>> PaneInfo.NO_FIRING). Note in particular that using the existing
>>>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>>>>> info if it is use for the result of a GBK, which I think is the loss
>>>>> of consistency referred to here.
>>>>>
>>>>
>>> Yes, this is exactly what I am proposing and sounds like what "approach
&

Re: Key encodings for state requests

2019-11-08 Thread jincheng sun

> Let us first define what are "standard coders". Usually it should be the
coders defined in the Proto. However, personally I think the coders defined
in the Java ModelCoders [1] seems more appropriate. The reason is that for
a coder which has already appeared in Proto and still not added to the Java
ModelCoders, it's always replaced by the runner with
LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
That's to say, that coder does not still take effect in the SDK harness.
Only when the coder is added in ModelCoders, it's 'known' and will take
effect.

Correct this point! The coder which is not contained in the Java
ModelCoders is replaced with LengthPrefixCoder[ByteArrayCoder] at runner
side and LengthPrefixCoder[CustomCoder] at SDK harness side.

The point here is that the runner determines whether it knows the coder
according to the coders defined in the Java ModelCoders, not the coders
defined in the proto file. So if taking option 3, the non-standard coders
which will be wrapped with LengthPrefixCoder should also be determined by
the coders defined in the Java ModerCoders, not the coders defined in the
proto file.

jincheng sun 于2019年11月9日 周六12:26写道：

> Hi Robert Bradshaw,
>
> Thanks a lot for the explanation. Very interesting topic!
>
> Let us first define what are "standard coders". Usually it should be the
> coders defined in the Proto. However, personally I think the coders defined
> in the Java ModelCoders [1] seems more appropriate. The reason is that for
> a coder which has already appeared in Proto and still not added to the Java
> ModelCoders, it's always replaced by the runner with
> LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
> data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
> That's to say, that coder does not still take effect in the SDK harness.
> Only when the coder is added in ModelCoders, it's 'known' and will take
> effect.
>
> So if we take option 3, the non-standard coders which will be wrapped with
> LengthPrefixCoder should be synced with the coders defined in the Java
> ModerCoders. (From this point of view, option 1 seems more clean!)
>
> Please correct me if I missed something. Thanks a lot！
>
> Best,
> Jincheng
>
> [1]
> https://github.com/apache/beam/blob/01726e9c62313749f9ea7c93063a1178abd1a8db/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoders.java#L59
>
> Robert Burke  于2019年11月9日周六 上午8:46写道：
>
>> And by "I wasn't clear" I meant "I misread the options".
>>
>> On Fri, Nov 8, 2019, 4:14 PM Robert Burke  wrote:
>>
>>> Reading back, I wasn't clear: the Go SDK does Option (1), putting the LP
>>> explicitly during encoding [1] for the runner proto, and explicitly expects
>>> LPs to contain a custom coder URN on decode for execution [2]. (Modulo an
>>> old bug in Dataflow where the urn was empty)
>>>
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L348
>>> [2]
>>> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L219
>>>
>>>
>>> On Fri, Nov 8, 2019, 10:28 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On Fri, Nov 8, 2019 at 2:09 AM jincheng sun 
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > Sorry for my late reply. It seems the conclusion has been reached. I
>>>> just want to share my personal thoughts.
>>>> >
>>>> > Generally, both option 1 and 3 make sense to me.
>>>> >
>>>> > >> The key concept here is not "standard coder" but "coder that the
>>>> > >> runner does not understand." This knowledge is only in the runner.
>>>> > >> Also has the downside of (2).
>>>> >
>>>> > >Yes, I had assumed "non-standard" and "unknown" are the same, but the
>>>> > >latter can be a subset of the former, i.e. if a Runner does not
>>>> support
>>>> > >all of the standard coders for some reason.
>>>> >
>>>> > I'm also assume that "non-standard" and "unknown" are the same.
>>>> Currently, in the runner side[1] it
>>>> > decides whether the coder is unknown(wrap with length prefix coder)
>>>> according to whether the coder is among
>>>> > the

Re: Key encodings for state requests

2019-11-08 Thread jincheng sun

Hi Robert Bradshaw,

Thanks a lot for the explanation. Very interesting topic!

Let us first define what are "standard coders". Usually it should be the
coders defined in the Proto. However, personally I think the coders defined
in the Java ModelCoders [1] seems more appropriate. The reason is that for
a coder which has already appeared in Proto and still not added to the Java
ModelCoders, it's always replaced by the runner with
LengthPrefixCoder[ByteArrayCoder] and so the SDK harness will decode the
data with LengthPrefixCoder[ByteArrayCoder] instead of the actual coder.
That's to say, that coder does not still take effect in the SDK harness.
Only when the coder is added in ModelCoders, it's 'known' and will take
effect.

So if we take option 3, the non-standard coders which will be wrapped with
LengthPrefixCoder should be synced with the coders defined in the Java
ModerCoders. (From this point of view, option 1 seems more clean!)

Please correct me if I missed something. Thanks a lot！

Best,
Jincheng

[1]
https://github.com/apache/beam/blob/01726e9c62313749f9ea7c93063a1178abd1a8db/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoders.java#L59

Robert Burke  于2019年11月9日周六 上午8:46写道：

> And by "I wasn't clear" I meant "I misread the options".
>
> On Fri, Nov 8, 2019, 4:14 PM Robert Burke  wrote:
>
>> Reading back, I wasn't clear: the Go SDK does Option (1), putting the LP
>> explicitly during encoding [1] for the runner proto, and explicitly expects
>> LPs to contain a custom coder URN on decode for execution [2]. (Modulo an
>> old bug in Dataflow where the urn was empty)
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L348
>> [2]
>> https://github.com/apache/beam/blob/4364a214dfe6d8d5dd84b1bb91d579f466492ca5/sdks/go/pkg/beam/core/runtime/graphx/coder.go#L219
>>
>>
>> On Fri, Nov 8, 2019, 10:28 AM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, Nov 8, 2019 at 2:09 AM jincheng sun 
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Sorry for my late reply. It seems the conclusion has been reached. I
>>> just want to share my personal thoughts.
>>> >
>>> > Generally, both option 1 and 3 make sense to me.
>>> >
>>> > >> The key concept here is not "standard coder" but "coder that the
>>> > >> runner does not understand." This knowledge is only in the runner.
>>> > >> Also has the downside of (2).
>>> >
>>> > >Yes, I had assumed "non-standard" and "unknown" are the same, but the
>>> > >latter can be a subset of the former, i.e. if a Runner does not
>>> support
>>> > >all of the standard coders for some reason.
>>> >
>>> > I'm also assume that "non-standard" and "unknown" are the same.
>>> Currently, in the runner side[1] it
>>> > decides whether the coder is unknown(wrap with length prefix coder)
>>> according to whether the coder is among
>>> > the standard coders. It will not communicate with harness to make this
>>> decision.
>>> >
>>> > So, from my point of view, we can update the PR according to option 1
>>> or 3.
>>> >
>>> > [1]
>>> https://github.com/apache/beam/blob/66a67e6580b93906038b31ae7070204cec90999c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/LengthPrefixUnknownCoders.java#L62
>>>
>>> That list is populated in Java code [1] and has typically been a
>>> subset of what is in the proto file. Things like StringUtf8Coder and
>>> DoubleCoder have been added at different times to different SDKs and
>>> Runners, sometimes long after the URN is in the proto. Having to keep
>>> this list synchronized (and versioned) would be a regression.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/release-2.17.0/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoders.java
>>>
>>> The PR taking approach (1) looks good at a first glance (I see others
>>> are reviewing it). Thanks.
>>>
>>> > Maximilian Michels  于2019年11月8日周五 上午3:35写道：
>>> >>
>>> >> > While the Go SDK doesn't yet support a State API, Option 3) is what
>>> the Go SDK does for all non-standard coders (aka custom coders) anyway.
>>> >>
>>> >> For wire transfer, the Java Runner also adds a LengthPrefixCo

Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-08 Thread jincheng sun

+1 for extend the discussion to the user mailing list?

Maximilian Michels  于2019年11月8日周五 下午6:32写道：

> The dates sounds good to me. I agree that the bay area has an advantage
> because of its large tech community. On the other hand, it is a question
> of how we run the event. For Berlin we managed to get about 200
> attendees to Berlin, but for the BeamSummit in Las Vegas with ApacheCon
> the attendance was much lower.
>
> Should this also be discussed on the user mailing list?
>
> Cheers,
> Max
>
> On 07.11.19 22:50, Alex Van Boxel wrote:
> > For date wise, I'm wondering why we should switching the Europe and NA
> > one, this would mean that the Berlin and the new EU summit would be
> > almost 1.5 years apart.
> >
> >   _/
> > _/ Alex Van Boxel
> >
> >
> > On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay  > > wrote:
> >
> > I prefer bay are for NA summit. My reasoning is that there is a
> > criticall mass of contributors and users in that location, probably
> > more than alternative NA locations. I was not involved with planning
> > recently and I do not know if there were people who could attend due
> > to location previously. If that is the case, I agree with Elliotte
> > on looking for other options.
> >
> > Related to dates: March (Asia) and mid-May (NA) dates are a bit
> > close. Mid-June for NA might be better to spread events. Other
> > pieces looks good.
> >
> > Ahmet
> >
> > On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
> > mailto:elh...@ibiblio.org>> wrote:
> >
> > The U.S. sadly is not a reliable destination for international
> > conferences these days. Almost every conference I go to, big and
> > small, has at least one speaker, sometimes more, who can't get
> into
> > the country. Canada seems worth considering. Vancouver,
> > Montreal, and
> > Toronto are all convenient.
> >
> > On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas  > > wrote:
> >  >
> >  > Hi Beam Community!
> >  >
> >  > I'd like to kick off a thread to discuss potential dates and
> > venues for the 2020 Beam Summits.
> >  >
> >  > I did some research on industry conferences happening in 2020
> > and pre-selected a few ranges as follows:
> >  >
> >  > (2 days) NA between mid-May and mid-June
> >  > (2 days) EU mid October
> >  > (1 day) Asia Mini Summit:  March
> >  >
> >  > I'd like to hear your thoughts on these dates and get
> > consensus on exact dates as the convo progresses.
> >  >
> >  > For locations these are the options I reviewed:
> >  >
> >  > NA: Austin Texas, Berkeley California, Mexico City.
> >  > Europe: Warsaw, Barcelona, Paris
> >  > Asia: Singapore
> >  >
> >  > Let the discussion begin!
> >  > G (on behalf of the Beam Summit Steering Committee)
> >  >
> >  >
> >  >
> >
> >
> > --
> > Elliotte Rusty Harold
> > elh...@ibiblio.org 
> >
>

Re: Key encodings for state requests

2019-11-08 Thread jincheng sun

Hi,

Sorry for my late reply. It seems the conclusion has been reached. I just
want to share my personal thoughts.

Generally, both option 1 and 3 make sense to me.

>> The key concept here is not "standard coder" but "coder that the
>> runner does not understand." This knowledge is only in the runner.
>> Also has the downside of (2).

>Yes, I had assumed "non-standard" and "unknown" are the same, but the
>latter can be a subset of the former, i.e. if a Runner does not support
>all of the standard coders for some reason.

I'm also assume that "non-standard" and "unknown" are the same. Currently,
in the runner side[1] it
decides whether the coder is unknown(wrap with length prefix coder)
according to whether the coder is among
the standard coders. It will not communicate with harness to make this
decision.

So, from my point of view, we can update the PR according to option 1 or 3.

[1]
https://github.com/apache/beam/blob/66a67e6580b93906038b31ae7070204cec90999c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/LengthPrefixUnknownCoders.java#L62

Maximilian Michels  于2019年11月8日周五 上午3:35写道：

> > While the Go SDK doesn't yet support a State API, Option 3) is what the
> Go SDK does for all non-standard coders (aka custom coders) anyway.
>
> For wire transfer, the Java Runner also adds a LengthPrefixCoder for the
> coder and its subcomponents. The problem is that this is an implicit
> assumption made. In the Proto, we do not have this represented. This is
> why **for state requests**, we end up with a
> "LengthPrefixCoder[CustomCoder]" on the Runner and a "CustomCoder" on
> the SDK Harness side. Note that the Python Harness does wrap unknown
> coders in a LengthPrefixCoder for transferring regular elements, but the
> LengthPrefixCoder is not preserved for the state requests.
>
> In that sense (3) is good because it follows this implicit notion of
> adding a LengthPrefixCoder for wire transfer, but applies it to state
> requests.
>
> However, option (1) is most reliable because the LengthPrefixCoder is
> actually in the Proto. So "CustomCoder" will always be represented as
> "LengthPrefixCoder[CustomCoder]", and only standard coders will be added
> without a LengthPrefixCoder.
>
> > I'd really like to avoid implicit agreements about how the coder that
> > should be used differs from what's specified in the proto in different
> > contexts.
>
> Option (2) would work on top of the existing logic because replacing a
> non-standard coder with a "NOOP coder" would just be used by the Runner
> to produce a serialized version of the key for partitioning. Flink
> always operates on the serialized key, be it standard or non-standard
> coder. It wouldn't be necessary to change any of the existing wire
> transfer logic or representation. I understand that it would be less
> ideal, but maybe easier to fix for the release.
>
> > The key concept here is not "standard coder" but "coder that the
> > runner does not understand." This knowledge is only in the runner.
> > Also has the downside of (2).
>
> Yes, I had assumed "non-standard" and "unknown" are the same, but the
> latter can be a subset of the former, i.e. if a Runner does not support
> all of the standard coders for some reason.
>
> > This means that the wire format that the runner sends for the "key"
> represents the exact same wire format it will receive for state requests.
>
> The wire format for the entire element is the same. Otherwise we
> wouldn't be able to process data between the Runner and the SDK Harness.
> However, the problem is that the way the Runner instantiates the key
> coder to partition elements, does not match how the SDK encodes the key
> when it sends a state request to the Runner. Conceptually, those two
> situations should be the same, but in practice they are not.
>
>
> Now that I thought about it again option (1) is probably the most
> explicit and in that sense cleanest. However, option (3) is kind of fair
> because it would just replicate the implicit LengthPrefixCoder behavior
> we have for general wire transfer also for state requests. Option (2) I
> suppose is the most implicit and runner-specific, should probably be
> avoided in the long run.
>
> So I'd probably opt for (1) and I would update the PR[1] rather soon
> because this currently blocks the release, as this is a regression from
> 2.16.0.[2]
>
>
> -Max
>
> [1] https://github.com/apache/beam/pull/9997
> [2] (In 2.16.0 it worked for Python because the Runner used a
> ByteArrayCoder with the OUTER encoding context for the key which was
> basically option (2). Only problem that, for standard coders the Java
> SDK Harness produced non-matching state request keys, due to it using
> the NESTED context.)
>
> On 07.11.19 18:01, Luke Cwik wrote:
> >
> >
> > On Thu, Nov 7, 2019 at 8:22 AM Robert Bradshaw  > > wrote:
> >
> > On Thu, Nov 7, 2019 at 6:26 AM Maximilian Michels  > > wrote:
> >  >
>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-07 Thread jincheng sun

Thanks for your feedback and the valuable comments, Kenn & Robert!

I think your comments are more comprehensive and enlighten me a lot. The
two proposals which I mentioned above are to reuse the existing coder
(FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
comments, I think we can further abstract 'FullWindowedValueCoder' and
'ValueOnlyWindowedValueCoder', that is, we can rename
'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
window/timestamp/pane configurable. Then we can remove
'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
for window/timestamp/pane.

I have replied your comments on the doc, and quick feedback as following:

Regarding to "Approach 2: probably no SDK harness work / compatible with
existing Beam model so no risk of introducing inconsistency",if we "just
puts default window/timestamp/pane info on elements" and don't change the
original coder, then the performance is not optimized. If we want to get
the best performance, then the default coder of Window/timestamp/pane
should be constant coder. In this case the SDK harnesses need to be aware
of the constant coder and there will be some development work in the SDK
harness. Besides, the SDK harness also needs to make the coders for
window/timestamp/pane configurable and this will introduce some related
changes, such as updating WindowedValueCoder._get_component_coders, etc.

Regarding to "Approach 1: option a: if the SDK harness has to understand
'values without windows' then very large changes and high risk of
introducing inconsistency (I eliminated many of these inconsistencies)", we
only need to add ValueOnlyWindowedValueCoder to the StandardCoders and all
the SDK harness should be aware of this coder. There is no much changes
actually.

Please feel free to correct me if there is anyting incorrect. :)

Besides, I'm not quite clear about the consistency issues you meant here.
Could you please give me some hints about this?

Best,
Jincheng

Robert Bradshaw  于2019年11月7日周四 上午3:38写道：

> Yes, the portability framework is designed to support this, and
> possibly even more efficient transfers of data than element-by-element
> as per the wire coder specified in the IO port operators. I left some
> comments on the doc as well, and would also prefer approach 2.
>
> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles  wrote:
> >
> > I think the portability framework is designed for this. The runner
> controls the coder on the grpc ports and the runner controls the process
> bundle descriptor.
> >
> > I commented on the doc. I think what is missing is analysis of scope of
> SDK harness changes and risk to model consistency
> >
> > Approach 2: probably no SDK harness work / compatible with existing
> Beam model so no risk of introducing inconsistency
> >
> > Approach 1: what are all the details?
> > option a: if the SDK harness has to understand "values without
> windows" then very large changes and high risk of introducing inconsistency
> (I eliminated many of these inconsistencies)
> > option b: if the coder just puts default window/timestamp/pane
> info on elements, then it is the same as approach 2, no work / no risk
> >
> > Kenn
> >
> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun 
> wrote:
> >>
> >> Hi all,
> >>
> >> I am trying to make some improvements of portability framework to make
> it usable in other projects. However, we find that the coder between runner
> and harness can only be FullWindowedValueCoder. This means each time when
> sending a WindowedValue, we have to encode/decode timestamp, windows and
> pan infos. In some circumstances(such as using the portability framework in
> Flink), only values are needed between runner and harness. So, it would be
> nice if we can configure the coder and avoid redundant encoding and
> decoding between runner and harness to improve the performance.
> >>
> >> There are two approaches to solve this issue:
> >>
> >> Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
> harness.
> >> Approach 2:  Add a "constant" window coder that embeds all the
> windowing information as part of the coder that should be used to wrap the
> value during decoding.
> >>
> >> More details can be found here [1].
> >>
> >> As of the shortcomings of “Approach 2” which still need to
> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
> which brings better performance and is more thorough.
> >>
> >> Welcome any feedback :)
> >>
> >> Best,
> >> Jincheng
> >>
> >> [1]
> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
> >>
>

[DISCUSS] Avoid redundant encoding and decoding between runner and harness

2019-11-06 Thread jincheng sun

Hi all,

I am trying to make some improvements of portability framework to make it
usable in other projects. However, we find that the coder between runner
and harness can only be FullWindowedValueCoder. This means each time when
sending a WindowedValue, we have to encode/decode timestamp, windows and
pan infos. In some circumstances(such as using the portability framework in
Flink), only values are needed between runner and harness. So, it would be
nice if we can configure the coder and avoid redundant encoding and
decoding between runner and harness to improve the performance.

There are two approaches to solve this issue:

Approach 1:  Support ValueOnlyWindowedValueCoder between runner and
harness.
Approach 2:  Add a "constant" window coder that embeds all the
windowing information as part of the coder that should be used to wrap the
value during decoding.

More details can be found here [1].

As of the shortcomings of “Approach 2” which still need to encode/decode
timestamp and pane infos, we tend to choose “Approach 1” which brings
better performance and is more thorough.

Welcome any feedback :)

Best,
Jincheng

[1]
https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-28 Thread jincheng sun

Sure, Thank you for your confirmation Luke! :)

Best,
Jincheng

Luke Cwik  于2019年10月29日周二 上午1:20写道：

> I would go with creating JIRAs and PRs directly since this doesn't seem to
> be contentious since you have received feedback from a few folks and they
> are all suggesting the same thing.
>
> On Sun, Oct 27, 2019 at 9:27 PM jincheng sun 
> wrote:
>
>> Hi all,
>>
>> Thanks a lot for your feedback. It seems that we have reached consensus
>> that both "Approach 2" and "Approach 3" are needed. "Approach 3" is a good
>> supplement for "Approach 2" and I also prefer "Approach 2" and "Approach 3"
>> for now.
>>
>> Do we need to vote on this discussion or I can create JIRAs and submit
>> the PRs directly?
>>
>> Best,
>> Jincheng
>>
>> Luke Cwik  于2019年10月26日周六 上午4:01写道：
>>
>>> Approach 3 is about caching the bundle descriptor forever but tearing
>>> down a "live" instance of the DoFns at some SDK chosen arbitrary point in
>>> time. This way if a future ProcessBundleRequest comes in, a new "live"
>>> instance can be constructed.
>>> Approach 2 is still needed so that when the workers are being
>>> shutdown all the "live" instances are torn down.
>>>
>>> On Fri, Oct 25, 2019 at 12:56 PM Robert Burke 
>>> wrote:
>>>
>>>> Approach 2 isn't incompatible with approach 3. 3 simple sets down
>>>> convention/configuration for the conditions when the SDK will do this after
>>>> process bundle has completed.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 25, 2019, 12:34 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> I think we'll still need approach (2) for when the pipeline finishes
>>>>> and a runner is tearing down workers.
>>>>>
>>>>> On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels 
>>>>> wrote:
>>>>> >
>>>>> > Hi Jincheng,
>>>>> >
>>>>> > Thanks for bringing this up and capturing the ideas in the doc.
>>>>> >
>>>>> > Intuitively, I would have also considered adding a new Proto message
>>>>> for
>>>>> > the teardown, but I think the idea to trigger this logic when the SDK
>>>>> > Harness evicts process bundle descriptors is more elegant.
>>>>> >
>>>>> > Thanks,
>>>>> > Max
>>>>> >
>>>>> > On 25.10.19 17:23, Luke Cwik wrote:
>>>>> > > I like approach 3 since it doesn't add additional complexity to
>>>>> the API
>>>>> > > and individual SDKs can choose to implement any clean-up strategy
>>>>> they
>>>>> > > want or none at all which is the simplest.
>>>>> > >
>>>>> > > On Thu, Oct 24, 2019 at 8:46 PM jincheng sun <
>>>>> sunjincheng...@gmail.com
>>>>> > > <mailto:sunjincheng...@gmail.com>> wrote:
>>>>> > >
>>>>> > > Hi,
>>>>> > >
>>>>> > > Thanks for your comments in doc, I have add Approach 3 which
>>>>> you
>>>>> > > mentioned! @Luke
>>>>> > >
>>>>> > > For now, we should do a decision for Approach 3 and Approach 1.
>>>>> > > Detail can be found in doc [1]
>>>>> > >
>>>>> > > Welcome anyone's feedback :)
>>>>> > >
>>>>> > > Regards,
>>>>> > > Jincheng
>>>>> > >
>>>>> > > [1]
>>>>> > >
>>>>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>>>> > >
>>>>> > > jincheng sun >>>> > > <mailto:sunjincheng...@gmail.com>> 于2019年10月25日周五 上午10:40写道：
>>>>> > >
>>>>> > > Hi,
>>>>> > >
>>>>> > > Functionally capable of `abort`, but it will be called at
>>>>> the
>>>>> > > end of operator. So, I prefer `dispose` semantics. i.e.,
>>>>> all
>>>>> > > normal logic has been executed.
>>>>> > >
>>>>&g

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-27 Thread jincheng sun

Hi all,

Thanks a lot for your feedback. It seems that we have reached consensus
that both "Approach 2" and "Approach 3" are needed. "Approach 3" is a good
supplement for "Approach 2" and I also prefer "Approach 2" and "Approach 3"
for now.

Do we need to vote on this discussion or I can create JIRAs and submit the
PRs directly?

Best,
Jincheng

Luke Cwik  于2019年10月26日周六 上午4:01写道：

> Approach 3 is about caching the bundle descriptor forever but tearing down
> a "live" instance of the DoFns at some SDK chosen arbitrary point in time.
> This way if a future ProcessBundleRequest comes in, a new "live" instance
> can be constructed.
> Approach 2 is still needed so that when the workers are being shutdown all
> the "live" instances are torn down.
>
> On Fri, Oct 25, 2019 at 12:56 PM Robert Burke  wrote:
>
>> Approach 2 isn't incompatible with approach 3. 3 simple sets down
>> convention/configuration for the conditions when the SDK will do this after
>> process bundle has completed.
>>
>>
>>
>> On Fri, Oct 25, 2019, 12:34 PM Robert Bradshaw 
>> wrote:
>>
>>> I think we'll still need approach (2) for when the pipeline finishes
>>> and a runner is tearing down workers.
>>>
>>> On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels 
>>> wrote:
>>> >
>>> > Hi Jincheng,
>>> >
>>> > Thanks for bringing this up and capturing the ideas in the doc.
>>> >
>>> > Intuitively, I would have also considered adding a new Proto message
>>> for
>>> > the teardown, but I think the idea to trigger this logic when the SDK
>>> > Harness evicts process bundle descriptors is more elegant.
>>> >
>>> > Thanks,
>>> > Max
>>> >
>>> > On 25.10.19 17:23, Luke Cwik wrote:
>>> > > I like approach 3 since it doesn't add additional complexity to the
>>> API
>>> > > and individual SDKs can choose to implement any clean-up strategy
>>> they
>>> > > want or none at all which is the simplest.
>>> > >
>>> > > On Thu, Oct 24, 2019 at 8:46 PM jincheng sun <
>>> sunjincheng...@gmail.com
>>> > > <mailto:sunjincheng...@gmail.com>> wrote:
>>> > >
>>> > > Hi,
>>> > >
>>> > > Thanks for your comments in doc, I have add Approach 3 which you
>>> > > mentioned! @Luke
>>> > >
>>> > > For now, we should do a decision for Approach 3 and Approach 1.
>>> > > Detail can be found in doc [1]
>>> > >
>>> > > Welcome anyone's feedback :)
>>> > >
>>> > > Regards,
>>> > > Jincheng
>>> > >
>>> > > [1]
>>> > >
>>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>> > >
>>> > > jincheng sun >> > > <mailto:sunjincheng...@gmail.com>> 于2019年10月25日周五 上午10:40写道：
>>> > >
>>> > > Hi,
>>> > >
>>> > > Functionally capable of `abort`, but it will be called at the
>>> > > end of operator. So, I prefer `dispose` semantics. i.e., all
>>> > > normal logic has been executed.
>>> > >
>>> > > Best,
>>> > > Jincheng
>>> > >
>>> > > Harsh Vardhan mailto:anan...@google.com
>>> >>
>>> > > 于2019年10月23日周三 上午12:14写道：
>>> > >
>>> > > Would approach 1 be akin to abort semantics?
>>> > >
>>> > > On Mon, Oct 21, 2019 at 8:01 PM jincheng sun
>>> > > >> sunjincheng...@gmail.com>>
>>> > > wrote:
>>> > >
>>> > > Hi Luke,
>>> > >
>>> > > Thanks a lot for your reply. Since it allows to share
>>> > > one SDK harness between multiple executable stages,
>>> the
>>> > > control service termination may occur much later than
>>> > > the completion of an executable stage. This is the
>>> main
>>> > > reason I prefer runners to control the teardown of
>>&g

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-24 Thread jincheng sun

Hi,

Thanks for your comments in doc, I have add Approach 3 which you
mentioned! @Luke

For now, we should do a decision for Approach 3 and Approach 1. Detail can
be found in doc [1]

Welcome anyone's feedback :)

Regards,
Jincheng

[1]
https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing

jincheng sun  于2019年10月25日周五 上午10:40写道：

> Hi,
>
> Functionally capable of `abort`, but it will be called at the end of
> operator. So, I prefer `dispose` semantics. i.e., all normal logic has been
> executed.
>
> Best,
> Jincheng
>
> Harsh Vardhan  于2019年10月23日周三 上午12:14写道：
>
>> Would approach 1 be akin to abort semantics?
>>
>> On Mon, Oct 21, 2019 at 8:01 PM jincheng sun 
>> wrote:
>>
>>> Hi Luke,
>>>
>>> Thanks a lot for your reply. Since it allows to share one SDK harness
>>> between multiple executable stages, the control service termination may
>>> occur much later than the completion of an executable stage. This is the
>>> main reason I prefer runners to control the teardown of DoFns.
>>>
>>> Regarding to "SDK harnesses can terminate instances any time they want
>>> and start new instances anytime as well.", personally I think it's not
>>> conflict with the proposed Approach 1 as the SDK harness could decide what
>>> to do when receiving the teardown request. It could do nothing if the DoFns
>>> has already been teared down and could also tear down the DoFns if needed.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Jincheng
>>>
>>> Luke Cwik  于2019年10月22日周二 上午2:05写道：
>>>
>>>> Approach 2 is currently the suggested approach[1] for DoFn's to
>>>> shutdown.
>>>> Note that SDK harnesses can terminate instances any time they want and
>>>> start new instances anytime as well.
>>>>
>>>> Why do you want to expose this logic so that Runners could control it?
>>>>
>>>> 1:
>>>> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
>>>>
>>>> On Mon, Oct 21, 2019 at 4:27 AM jincheng sun 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I found that in `SdkHarness` do not  stop the `SdkWorker` when
>>>>> finish.  We should add the logic for stop the `SdkWorker` in `SdkHarness`.
>>>>> More detail can be found [1].
>>>>>
>>>>> There are two approaches to solve this issue:
>>>>>
>>>>> Approach 1:  We can add a Fn API for teardown purpose and the runner
>>>>> will teardown a specific bundle descriptor via this teardown Fn API during
>>>>> disposing.
>>>>> Approach 2: The control service termination could be seen as a signal
>>>>> and once SDK harness receives this signal, the teardown of the bundle
>>>>> descriptor will be performed.
>>>>>
>>>>> More detail can be found in [2].
>>>>>
>>>>> As the Approach 2, SDK harness could be shared between multiple
>>>>> executable stages. The control service termination only occurs when all 
>>>>> the
>>>>> executable stages sharing the same SDK harness finished. This means that
>>>>> the teardown of DoFns may not be executed immediately after an executable
>>>>> stage is finished.
>>>>>
>>>>> So, I prefer Approach 1. Welcome any feedback :)
>>>>>
>>>>> Best,
>>>>> Jincheng
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
>>>>> [2]
>>>>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>>>>
>>>> --
>>
>> Got feedback? go/harsh-feedback <https://goto.google.com/harsh-feedback>
>>
>

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-24 Thread jincheng sun

Hi,

Functionally capable of `abort`, but it will be called at the end of
operator. So, I prefer `dispose` semantics. i.e., all normal logic has been
executed.

Best,
Jincheng

Harsh Vardhan  于2019年10月23日周三 上午12:14写道：

> Would approach 1 be akin to abort semantics?
>
> On Mon, Oct 21, 2019 at 8:01 PM jincheng sun 
> wrote:
>
>> Hi Luke,
>>
>> Thanks a lot for your reply. Since it allows to share one SDK harness
>> between multiple executable stages, the control service termination may
>> occur much later than the completion of an executable stage. This is the
>> main reason I prefer runners to control the teardown of DoFns.
>>
>> Regarding to "SDK harnesses can terminate instances any time they want
>> and start new instances anytime as well.", personally I think it's not
>> conflict with the proposed Approach 1 as the SDK harness could decide what
>> to do when receiving the teardown request. It could do nothing if the DoFns
>> has already been teared down and could also tear down the DoFns if needed.
>>
>> What do you think?
>>
>> Best,
>> Jincheng
>>
>> Luke Cwik  于2019年10月22日周二 上午2:05写道：
>>
>>> Approach 2 is currently the suggested approach[1] for DoFn's to shutdown.
>>> Note that SDK harnesses can terminate instances any time they want and
>>> start new instances anytime as well.
>>>
>>> Why do you want to expose this logic so that Runners could control it?
>>>
>>> 1:
>>> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
>>>
>>> On Mon, Oct 21, 2019 at 4:27 AM jincheng sun 
>>> wrote:
>>>
>>>> Hi,
>>>> I found that in `SdkHarness` do not  stop the `SdkWorker` when finish.
>>>> We should add the logic for stop the `SdkWorker` in `SdkHarness`.  More
>>>> detail can be found [1].
>>>>
>>>> There are two approaches to solve this issue:
>>>>
>>>> Approach 1:  We can add a Fn API for teardown purpose and the runner
>>>> will teardown a specific bundle descriptor via this teardown Fn API during
>>>> disposing.
>>>> Approach 2: The control service termination could be seen as a signal
>>>> and once SDK harness receives this signal, the teardown of the bundle
>>>> descriptor will be performed.
>>>>
>>>> More detail can be found in [2].
>>>>
>>>> As the Approach 2, SDK harness could be shared between multiple
>>>> executable stages. The control service termination only occurs when all the
>>>> executable stages sharing the same SDK harness finished. This means that
>>>> the teardown of DoFns may not be executed immediately after an executable
>>>> stage is finished.
>>>>
>>>> So, I prefer Approach 1. Welcome any feedback :)
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
>>>> [2]
>>>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>>>
>>> --
>
> Got feedback? go/harsh-feedback <https://goto.google.com/harsh-feedback>
>

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-21 Thread jincheng sun

Hi Luke,

Thanks a lot for your reply. Since it allows to share one SDK harness
between multiple executable stages, the control service termination may
occur much later than the completion of an executable stage. This is the
main reason I prefer runners to control the teardown of DoFns.

Regarding to "SDK harnesses can terminate instances any time they want and
start new instances anytime as well.", personally I think it's not conflict
with the proposed Approach 1 as the SDK harness could decide what to do
when receiving the teardown request. It could do nothing if the DoFns has
already been teared down and could also tear down the DoFns if needed.

What do you think?

Best,
Jincheng

Luke Cwik  于2019年10月22日周二 上午2:05写道：

> Approach 2 is currently the suggested approach[1] for DoFn's to shutdown.
> Note that SDK harnesses can terminate instances any time they want and
> start new instances anytime as well.
>
> Why do you want to expose this logic so that Runners could control it?
>
> 1:
> https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
>
> On Mon, Oct 21, 2019 at 4:27 AM jincheng sun 
> wrote:
>
>> Hi,
>> I found that in `SdkHarness` do not  stop the `SdkWorker` when finish.
>> We should add the logic for stop the `SdkWorker` in `SdkHarness`.  More
>> detail can be found [1].
>>
>> There are two approaches to solve this issue:
>>
>> Approach 1:  We can add a Fn API for teardown purpose and the runner will
>> teardown a specific bundle descriptor via this teardown Fn API during
>> disposing.
>> Approach 2: The control service termination could be seen as a signal and
>> once SDK harness receives this signal, the teardown of the bundle
>> descriptor will be performed.
>>
>> More detail can be found in [2].
>>
>> As the Approach 2, SDK harness could be shared between multiple
>> executable stages. The control service termination only occurs when all the
>> executable stages sharing the same SDK harness finished. This means that
>> the teardown of DoFns may not be executed immediately after an executable
>> stage is finished.
>>
>> So, I prefer Approach 1. Welcome any feedback :)
>>
>> Best,
>> Jincheng
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
>> [2]
>> https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing
>>
>

[DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-21 Thread jincheng sun

Hi,
I found that in `SdkHarness` do not  stop the `SdkWorker` when finish.  We
should add the logic for stop the `SdkWorker` in `SdkHarness`.  More detail
can be found [1].

There are two approaches to solve this issue:

Approach 1:  We can add a Fn API for teardown purpose and the runner will
teardown a specific bundle descriptor via this teardown Fn API during
disposing.
Approach 2: The control service termination could be seen as a signal and
once SDK harness receives this signal, the teardown of the bundle
descriptor will be performed.

More detail can be found in [2].

As the Approach 2, SDK harness could be shared between multiple executable
stages. The control service termination only occurs when all the executable
stages sharing the same SDK harness finished. This means that the teardown
of DoFns may not be executed immediately after an executable stage is
finished.

So, I prefer Approach 1. Welcome any feedback :)

Best,
Jincheng

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker.py
[2]
https://docs.google.com/document/d/1sCgy9VQPf9zVXKRquK8P6N4x7aB62GEO8ozkujRSHZg/edit?usp=sharing

Re: Apply Confluence permission for fix `Python Tips` wiki command issue

2019-10-11 Thread jincheng sun

Hi Thomas,

Thank you for granting access, I have updated the `Python Tips`.

Best,
Jincheng

Thomas Weise  于2019年10月10日周四 下午9:38写道：

> Thanks for spotting this.
>
> Access granted (for user "jincheng").
>
>
> On Thu, Oct 10, 2019 at 12:01 AM jincheng sun 
> wrote:
>
>> Hi all,
>>
>> The docker config has been removed with the latest python3 docker related
>> commit [1], So the command:
>>   > ./gradlew :sdks:python:container:docker
>> in `Python Tips`[2] can not work well, we should correct it, something
>> like:
>>  > ./gradlew :sdks:python:container:py35:docker
>>
>> The flink 1.5 and 1.6 are removed by BEAM-7962 [3], So we also need
>> correct the command in `Python Tips`:
>>   > ./gradlew :runners:flink:1.5:job-server:container:docker
>> We should change 1.5 to 1.7.
>>
>> I want correct the command in `Python Tips`, so I appreciate if anyone
>> can give me the Confluence permission.
>>
>> Best,
>> Jincheng
>>
>> [1]
>> https://github.com/apache/beam/commit/47feeafb21023e2a60ae51737cc4000a2033719c#diff-1bc5883bcfcc9e883ab7df09e4dcddb0L63
>> [2] https://cwiki.apache.org/confluence/display/BEAM/Python+Tips
>> [3] https://github.com/apache/beam/pull/9632
>>
>>

Apply Confluence permission for fix `Python Tips` wiki command issue

2019-10-10 Thread jincheng sun

Hi all,

The docker config has been removed with the latest python3 docker related
commit [1], So the command:
  > ./gradlew :sdks:python:container:docker
in `Python Tips`[2] can not work well, we should correct it, something like:
 > ./gradlew :sdks:python:container:py35:docker

The flink 1.5 and 1.6 are removed by BEAM-7962 [3], So we also need correct
the command in `Python Tips`:
  > ./gradlew :runners:flink:1.5:job-server:container:docker
We should change 1.5 to 1.7.

I want correct the command in `Python Tips`, so I appreciate if anyone can
give me the Confluence permission.

Best,
Jincheng

[1]
https://github.com/apache/beam/commit/47feeafb21023e2a60ae51737cc4000a2033719c#diff-1bc5883bcfcc9e883ab7df09e4dcddb0L63
[2] https://cwiki.apache.org/confluence/display/BEAM/Python+Tips
[3] https://github.com/apache/beam/pull/9632

Re: Received status code 500 from server: Internal Server Error

2019-10-08 Thread jincheng sun

Thanks for your quick response, I think your approach is work locally.
Thank you!  @ Łukasz
I seed this info to dev ML not for my local env. But for all the
Beam Contributors, because they will got the PreCommit fail when open the
PR. Anyway, Thanks for your kind solution for locally.
Currently, I trigger the PreCommit in the PR, It works now.

Best,
Jincheng

Kyle Weaver  于2019年10月8日周二 下午11:01写道：

> Is there a way we could add a backup server url to our configuration to
> use if the sonatype server is down?
>
> On Tue, Oct 8, 2019 at 4:56 AM Łukasz Gajowy 
> wrote:
>
>> of course I meant:
>>
>> maven { url "https://oss.sonatype.org/content/repositories/staging/; }
>> => maven { url "https://repo1.maven.org/maven2/; }
>>
>> :)
>>
>> wt., 8 paź 2019 o 13:53 Łukasz Gajowy 
>> napisał(a):
>>
>>> It seems that oss.sonatype was down and this prevented us from
>>> downloading required resources from it. I can't use it either (I was
>>> getting the same error). If running gradle in offline mode does not help
>>> (--offline flag) another temporary solution is to replace the url In
>>> Repository.groovy
>>> <https://github.com/apache/beam/blob/85c0e8364376f19f2e0eb5b5c7bea6639702725b/buildSrc/src/main/groovy/org/apache/beam/gradle/Repositories.groovy#L50>
>>>  to
>>> e.g. a maven central one when working locally:
>>>
>>> maven { url "https://repo1.maven.org/maven2/; } => maven { url "
>>> https://oss.sonatype.org/content/repositories/staging/; }
>>>
>>> When sonatype is up again you should be fine without this hack.
>>>
>>> I hope this helps.
>>>
>>> Łukasz
>>>
>>> wt., 8 paź 2019 o 11:55 jincheng sun 
>>> napisał(a):
>>>
>>>> Hi all,
>>>> I got the 500 error, when do the PreCommit. We can run the following
>>>> command to see the detail:
>>>>
>>>> ./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner
>>>>
>>>> >>
>>>> Task :model:pipeline:compileJava FAILED
>>>> FAILURE: Build failed with an exception.
>>>> * What went wrong:
>>>> Execution failed for task ':model:pipeline:compileJava'.
>>>> > Could not resolve all files for configuration
>>>> ':model:pipeline:errorprone'.
>>>>> Could not resolve
>>>> com.google.errorprone:error_prone_core:latest.release.
>>>>  Required by:
>>>>  project :model:pipeline
>>>>   > Failed to list versions for
>>>> com.google.errorprone:error_prone_core.
>>>>  > Unable to load Maven meta-data from
>>>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
>>>> .
>>>> > Could not HEAD '
>>>> https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
>>>> Received status code 500 from server: Internal Server Error
>>>>
>>>> I appreciate if anyone help solve the server problem！
>>>>
>>>> Best,
>>>> Jincheng
>>>>
>>>>
>>>>
>>>>

Received status code 500 from server: Internal Server Error

2019-10-08 Thread jincheng sun

Hi all,
I got the 500 error, when do the PreCommit. We can run the following
command to see the detail:

./gradlew :sdks:python:test-suites:portable:py2:flinkValidatesRunner

>>
Task :model:pipeline:compileJava FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':model:pipeline:compileJava'.
> Could not resolve all files for configuration
':model:pipeline:errorprone'.
   > Could not resolve
com.google.errorprone:error_prone_core:latest.release.
 Required by:
 project :model:pipeline
  > Failed to list versions for com.google.errorprone:error_prone_core.
 > Unable to load Maven meta-data from
https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml
.
> Could not HEAD '
https://oss.sonatype.org/content/repositories/staging/com/google/errorprone/error_prone_core/maven-metadata.xml'.
Received status code 500 from server: Internal Server Error

I appreciate if anyone help solve the server problem！

Best,
Jincheng

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-09-02 Thread jincheng sun

Congrats Valentyn !

Cheers,
Jincheng

Katarzyna Kucharczyk  于2019年9月2日周一 下午5:20写道：

> Congratulations Valentyn! 
>
> On Wed, Aug 28, 2019 at 3:49 PM Frederik Bode 
> wrote:
>
>> Congrats Valentyn!!
>>
>>
>> On Wed, 28 Aug 2019 at 15:28, Tanay Tummalapalli 
>> wrote:
>>
>>> Congratulations Valentyn!
>>>
>>> On Wed, Aug 28, 2019 at 7:16 AM Ruoyun Huang  wrote:
>>>
 Congratulations Valentyn!

 On Tue, Aug 27, 2019 at 6:16 PM Daniel Oliveira 
 wrote:

> Congratulations Valentyn!
>
> On Tue, Aug 27, 2019, 11:31 AM Boyuan Zhang 
> wrote:
>
>> Congratulations!
>>
>> On Tue, Aug 27, 2019 at 10:44 AM Udi Meiri  wrote:
>>
>>> Congrats!
>>>
>>> On Tue, Aug 27, 2019 at 9:50 AM Yichi Zhang 
>>> wrote:
>>>
 Congrats Valentyn!

 On Tue, Aug 27, 2019 at 7:55 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Thank you everyone!
>
> On Tue, Aug 27, 2019 at 2:57 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> Congrats, well deserved!
>>
>> On 27 Aug 2019, at 11:25, Jan Lukavský  wrote:
>>
>> Congrats Valentyn!
>> On 8/26/19 11:43 PM, Rui Wang wrote:
>>
>> Congratulations!
>>
>>
>> -Rui
>>
>> On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang <
>> hannahji...@google.com> wrote:
>>
>>> Congratulations Valentyn, well deserved!
>>>
>>> On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 Congrats Valentyn!

 On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada <
 pabl...@google.com> wrote:

> Thanks Valentyn!
>
> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu 
> wrote:
>
>> Thank you Valentyn! Congratulations!
>>
>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a
>>> new
>>> committer: Valentyn Tymofieiev
>>>
>>> Valentyn has made numerous contributions to Beam over the
>>> last several
>>> years (including 100+ pull requests), most recently pushing
>>> through
>>> the effort to make Beam compatible with Python 3. He is also
>>> an active
>>> participant in design discussions on the list, participates
>>> in release
>>> candidate validation, and proactively helps keep our tests
>>> green.
>>>
>>> In consideration of Valentyn's contributions, the Beam PMC
>>> trusts him
>>> with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Valentyn, for your contributions and looking
>>> forward to many more!
>>>
>>> Robert, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>
>>

 --
 
 Ruoyun  Huang

Re: [PROPOSAL] Preparing for Beam 2.16.0 release

2019-08-29 Thread jincheng sun

Hi Mark,

+1 and thank you for keeping the cadence!

BTW I have mark the Fix Version for some of issues to 2.17, which can not
be merged into 2.16.

Best,
Jincheng

Mark Liu  于2019年8月29日周四 上午6:14写道：

> Hi all,
>
> Beam 2.16 release branch cut is scheduled on Sep 11 according to the
> release calendar [1]. I would like to volunteer myself to do this
> release. The plan is to cut the branch on that date, and cherrypick
> release-blocking fixes afterwards if any.
>
> If you have release blocking issues for 2.16 please mark their "Fix
> Version" as 2.16.0 [2]. This tag is already created in JIRA in case you
> would like to move any non-blocking issues to that version.
>
> Any thoughts, comments, objections?
>
> Regards.
> Mark Liu
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2]
> https://issues.apache.org/jira/browse/BEAM-8105?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.16.0
>

Re: [ANNOUNCE] Beam 2.15.0 Released!

2019-08-26 Thread jincheng sun

Cheers!! Thanks for driving the release, Yifan!
Thanks a lot to everyone who helped making this release possible!

Best,
Jincheng

Thomas Weise  于2019年8月27日周二 下午12:54写道：

> Yifan, thanks for managing this release. It went smoothly!
>
>
> On Fri, Aug 23, 2019 at 2:32 PM Kenneth Knowles  wrote:
>
>> Nice work!
>>
>> On Fri, Aug 23, 2019 at 11:26 AM Charles Chen  wrote:
>>
>>> Thank you Yifan!
>>>
>>> On Fri, Aug 23, 2019 at 11:12 AM Hannah Jiang 
>>> wrote:
>>>
 Thank you Yifan!

 On Fri, Aug 23, 2019 at 11:09 AM Yichi Zhang  wrote:

> Thank you Yifan!
>
> On Fri, Aug 23, 2019 at 11:06 AM Robin Qiu  wrote:
>
>> Thank you Yifan!
>>
>> On Fri, Aug 23, 2019 at 11:05 AM Rui Wang  wrote:
>>
>>> Thank you Yifan!
>>>
>>> -Rui
>>>
>>> On Fri, Aug 23, 2019 at 9:21 AM Pablo Estrada 
>>> wrote:
>>>
 Thanks Yifan!

 On Fri, Aug 23, 2019 at 8:54 AM Connell O'Callaghan <
 conne...@google.com> wrote:

>
> +1 thank you Yifan!!!
>
> On Fri, Aug 23, 2019 at 8:49 AM Ahmet Altay 
> wrote:
>
>> Thank you Yifan!
>>
>> On Fri, Aug 23, 2019 at 8:00 AM Yifan Zou 
>> wrote:
>>
>>> The Apache Beam team is pleased to announce the release of
>>> version 2.15.0.
>>>
>>> Apache Beam is an open source unified programming model to
>>> define and
>>> execute data processing pipelines, including ETL, batch and
>>> stream
>>> (continuous) processing. See https://beam.apache.org
>>>
>>> You can download the release here:
>>>
>>> https://beam.apache.org/get-started/downloads/
>>>
>>> This release includes bug fixes, features, and improvements
>>> detailed on
>>> the Beam blog:
>>> https://beam.apache.org/blog/2019/08/22/beam-2.15.0.html
>>>
>>> Thanks to everyone who contributed to this release, and we hope
>>> you enjoy
>>> using Beam 2.15.0.
>>>
>>> Yifan Zou
>>>
>>

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-26 Thread jincheng sun

Congrats Valentyn!

Best,
Jincheng

Ankur Goenka  于2019年8月27日周二 上午10:37写道：

> Congratulations Valentyn!
>
> On Mon, Aug 26, 2019, 5:02 PM Yifan Zou  wrote:
>
>> Congratulations, Valentyn! Well deserved!
>>
>> On Mon, Aug 26, 2019 at 3:31 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Congratulations! and thank you for your contributions, Valentyn!
>>>
>>> On Mon, Aug 26, 2019 at 3:26 PM Thomas Weise  wrote:
>>>
 Congrats!


 On Mon, Aug 26, 2019 at 3:22 PM Heejong Lee  wrote:

> Congratulations! :)
>
> On Mon, Aug 26, 2019 at 2:44 PM Rui Wang  wrote:
>
>> Congratulations!
>>
>>
>> -Rui
>>
>> On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang 
>> wrote:
>>
>>> Congratulations Valentyn, well deserved!
>>>
>>> On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 Congrats Valentyn!

 On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada 
 wrote:

> Thanks Valentyn!
>
> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu 
> wrote:
>
>> Thank you Valentyn! Congratulations!
>>
>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Valentyn Tymofieiev
>>>
>>> Valentyn has made numerous contributions to Beam over the last
>>> several
>>> years (including 100+ pull requests), most recently pushing
>>> through
>>> the effort to make Beam compatible with Python 3. He is also an
>>> active
>>> participant in design discussions on the list, participates in
>>> release
>>> candidate validation, and proactively helps keep our tests green.
>>>
>>> In consideration of Valentyn's contributions, the Beam PMC
>>> trusts him
>>> with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Valentyn, for your contributions and looking forward
>>> to many more!
>>>
>>> Robert, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-12 Thread jincheng sun

Hi all,

Thanks for your confirmation Robert! :)

Thanks for share the details about the state discussion Luke! :)

The MapState is a bit complex, I think it's better to add some detail
design doc when we deal with the map state supported.

I will create JIRAs and follow up on subsequent developments. If there are
big changes, I will provide detailed design documentation and bring up the
discussion in ML.

Thanks everyone for joining this discussion.

Best,
Jincheng

Lukasz Cwik  于2019年8月7日周三 下午8:19写道：

> I wanted to add some more details about the state discussion.
>
> BEAM-7000 is about adding support for a gRPC message saying that the SDK
> is now blocked on one of its requests. This would allow for an easy
> optimization on the runner side where it gathers requests and is able to
> batch them knowing that the SDK is only blocked once it sees one of the
> blocked gRPC messages. This would make it easy for the runner to gather up
> clear + append calls and convert them to sets internally.
>
> Also, most of the reason around map state not existing has been since we
> haven't discuessed the changes to the gRPC APIs that we need. (things like,
> can you lookup/clear/append to ranges?, map or multimap?, should we really
> just get rid of bag state in favor of a multimap state?, can you enumerate
> keys?, know how many keys there are?, ...)
>
> On Wed, Aug 7, 2019 at 9:52 AM Robert Bradshaw 
> wrote:
>
>> The list looks good to me. Thanks for summarizing. Feel free to dive
>> into any of these issues yourself :).
>>
>> On Fri, Aug 2, 2019 at 6:24 PM jincheng sun 
>> wrote:
>> >
>> > Hi all,
>> >
>> >
>> > Thanks a lot for sharing your thoughts!
>> >
>> >
>> > It seems that we have already reached consensus for the following
>> items. Could you please read through them again and double-check if you all
>> agree with these? If yes, then I would start creating JIRA issues for those
>> that don’t yet have a JIRA issue
>> >
>> >
>> > 1. Items that require improvements of Beam:
>> >
>> >
>> > 1) The configuration of "semi_persist_dir" should be configurable.
>> (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L48
>> >
>> >
>> > 2) Time-based cache threshold should be supported. (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L259
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataBufferingOutboundObserver.java
>> >
>> >
>> > 3) Cross-bundle cache should be supported. (
>> https://issues.apache.org/jira/browse/BEAM-5428)
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
>> >
>> >
>> > 4) Allows to configure the log level. (TODO)
>> >
>> > https://issues.apache.org/jira/browse/BEAM-5468
>> >
>> >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L102
>> >
>> >
>> > 5) Improves the interfaces of classes such as FnDataService,
>> BundleProcessor, ActiveBundle, etc to change the parameter type from
>> WindowedValue to T. (TODO)
>> >
>> >
>> > 6) Python 3 is already supported in Beam. The warning should be
>> removed. (TODO)
>> >
>> > https://github.com/apache/beam/blob/master/sdks/python/setup.py#L179
>> >
>> >
>> > 7) The coder of WindowedValue should be configurable which makes it
>> possible to use customization coder such as ValueOnlyWindowedValueCoder.
>> (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L91
>> >
>> >
>> > 8) The schema work can be used to solve the performance issue of the
>> extra prefixing length of encoding. However, it should also be supported in
>> Python. (https://github.com/apache/beam/pull/9188)
>> >
>> >
>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto
>> >
>> >
>> > 9) MapState should be supported in the gRPC protocol. (TODO)
>> >
>> >
>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L662
>> >
>> >
>> >
>> >
>> >

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-02 Thread jincheng sun

Hi all,

Thanks a lot for sharing your thoughts!

It seems that we have already reached consensus for the following items.
Could you please read through them again and double-check if you all agree
with these? If yes, then I would start creating JIRA issues for those that
don’t yet have a JIRA issue

1. Items that require improvements of Beam:

1) The configuration of "semi_persist_dir" should be configurable. (TODO)

https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L48

2) Time-based cache threshold should be supported. (TODO)

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L259

https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataBufferingOutboundObserver.java

3) Cross-bundle cache should be supported. (
https://issues.apache.org/jira/browse/BEAM-5428)

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349

4) Allows to configure the log level. (TODO)

https://issues.apache.org/jira/browse/BEAM-5468

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L102

5) Improves the interfaces of classes such as FnDataService,
BundleProcessor, ActiveBundle, etc to change the parameter type from
WindowedValue to T. (TODO)

6) Python 3 is already supported in Beam. The warning should be removed.
(TODO)

https://github.com/apache/beam/blob/master/sdks/python/setup.py#L179

7) The coder of WindowedValue should be configurable which makes it
possible to use customization coder such as ValueOnlyWindowedValueCoder.
(TODO)

https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L91

8) The schema work can be used to solve the performance issue of the extra
prefixing length of encoding. However, it should also be supported in
Python. (https://github.com/apache/beam/pull/9188)

https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto

9) MapState should be supported in the gRPC protocol. (TODO)

https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L662



2. Items where we don’t need to do anything for now:

1) The default buffer size is enough for most cases and there is no need to
make it configurable for now.

2) Do not support ValueState in the gRPC protocol for now unless we have
evidence it matters.


If there are any incorrect understanding,  please feel free to correct me :)



There are also some items that I didn’t bring up earlier which require
further discussion:

1) The input queue size of the input buffer in Python SDK Harness is not
size limited. We should give a reasonable default size.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L175

2) Refactor the code to avoid unnecessary dependencies pull in. For
example, beam-sdks-java-core(11MB) is a package for Java SDK users and it
is pull in because there are a few classes in beam-sdks-java-core are used
in beam-runners-java-fn-execution, such as:

PipelineOptions used in DefaultJobBundleFactory FileSystems used in
BeamFileSystemArtifactRetrievalService.

It means maybe we can add a new module such as beam-sdks-java-common to
hold the classes used by both runner and SDK.

3) Allows to start up StatusServer according to configuration in Python SDK
Harness. Currently the StatusServer is start up by default.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L113


Best,

Jincheng

jincheng sun  于2019年8月2日周五 下午4:14写道：

> Thanks for share the detail of the current StandardCoders Max!
> That's true, Flink may should defined some of coders, And I will share the
> POC in the Flink Python UDFs DISCUSS Thread later :)
>
> Best,
> Jincheng
>
> Maximilian Michels  于2019年7月31日周三 下午2:53写道：
>
>> Hi Jincheng,
>>
>> Thanks for getting back to us.
>>
>> > For the next major release of Flink, we plan to add Python user defined
>> functions(UDF, UDTF, UDAF) support in Flink and I have go over the Beam
>> portability framework and think that it is perfect for our requirements.
>> However we also find some improvements needed for Beam:
>>
>> That sounds great! The improvement list contains very reasonable
>> suggestions, some of them which are already on our TODO list. I think
>> Thomas and Robert already provided the answers you were looking for.
>>
>> > Open questions:
>> > -
>> > 1) Which coders should be / can be defined in StandardCoders?
>>
>> The ones which are present now those are:
>>
>>   BYTES_CODER
>>   INT64_CODER
>>   STRING

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-02 Thread jincheng sun

> coders are defined in StandardCoders. It means that for most coder, a
> length will be added to the serialized bytes which is not necessary in my
> thoughts. My suggestion is maybe we can add some interfaces or tags for the
> coder which indicate whether the coder is needed a length prefix or not.
> >
> > 6) Set log level according to PipelineOption in Python SDK Harness.
> Currently the log level is set to INFO by default.
> >
> > 7) Allows to start up StatusServer according to PipelineOption in Python
> SDK Harness. Currently the StatusServer is start up by default.
> >
> > Although I put 3) 4) 5) into the "Nice to Have" as they are performance
> related, I still think they are very critical for Python UDF execution
> performance.
> >
> > Open questions:
> > -
> > 1) Which coders should be / can be defined in StandardCoders?
> >
> > Currently we are preparing the design of how to support Python UDF in
> Flink based on the Beam portability framework and we will bring up the
> discussion in Flink community. We may propose more changes for Beam during
> that time and may need more support from Beam community.
> >
> > To be honest, I'm not an expert of Beam and so please feel free to
> correct me if my understanding is wrong. Welcome any feedback.
> >
> > Best,
> > Jincheng
>
>
>
>
>
>
> Cheers,
> Max
>
> On 31.07.19 12:16, Robert Bradshaw wrote:
> > Yep, Python support under active development,
> > e.g. https://github.com/apache/beam/pull/9188
> >
> > On Wed, Jul 31, 2019 at 9:24 AM jincheng sun  > <mailto:sunjincheng...@gmail.com>> wrote:
> >
> > Thanks a lot for sharing the link. I take a quick look at the design
> > and the implementation in Java and think it could address my
> > concern. It seems that it's still not supported in the Python SDK
> > Harness. Is there any plan on that?
> >
> > Robert Bradshaw mailto:rober...@google.com>>
> > 于2019年7月30日周二 下午12:33写道：
> >
> > On Tue, Jul 30, 2019 at 11:52 AM jincheng sun
> > mailto:sunjincheng...@gmail.com>>
> wrote:
> >
> >
> > Is it possible to add an interface such as
> > `isSelfContained()` to the `Coder`? This interface
> > indicates
> > whether the serialized bytes are self contained. If
> > it returns true, then there is no need to add a
> > prefixing length.
> > In this way, there is no need to introduce an extra
> > protocol,  Please correct me if I missed something :)
> >
> >
> > The question is how it is self contained. E.g.
> > DoubleCoder is self contained because it always uses
> > exactly 8 bytes, but one needs to know the double coder
> > to leverage this. VarInt coder is self-contained a
> > different way, as is StringCoder (which does just do
> > prefixing).
> >
> >
> > Yes, you are right! I think it again that we can not add
> > such interface for the coder, due to runner can not call it.
> > And just one more thought: does it make sense to add a
> > method such as "registerSelfContained Coder(xxx)" or so to
> > let users register the coders which can be processed in the
> > SDK Harness?  It's the responsibility of the SDK harness to
> > ensure that the coder is supported.
> >
> >
> > Basically, a "please don't add length prefixing to this coder,
> > assume everyone else can understand it (and errors will ensue if
> > anyone doesn't)" at the user level? Seems a bit dangerous. Also,
> > there is not "the SDK"--there may be multiple other SDKs in
> > general, and of course runner components, some of which may
> > understand the coder in question and some of which may not.
> >
> > I would say that if this becomes a problem, we could look at the
> > pros and cons of various remedies, this being one alternative.
> >
> >
> >
> >
> > I am hopeful that schemas give us a rich enough way to
> > encode the vast majority of types that we will want to
> > transmit across language barriers (possibly with some
> > widening promotions). For high performance one will want
> > to use formats like arrow rather than one-off coders as
> > well, which also biases us towards the schema work. The
> > set of StandardCoders is not closed, and nor is the
> > possibility of figuring out a way to communicate outside
> > this set for a particular pair of languages, but I think
> > it makes sense to avoid going that direction unless we
> > have to due to the increased API surface aread and
> > complexity it imposes on all runners and SDKs.
> >
> >
> > Great! Could you share some links about the schema work. It
> > seems very interesting and promising.
> >
> >
> > https://beam.apache.org/contribute/design-documents/#sql--schema
>  and
> > of particular relevance https://s.apache.org/beam-schemas
> >
> >
> >
>

Re: [ANNOUNCE] New committer: Jan Lukavský

2019-08-01 Thread jincheng sun

Congratulations Jan!
Best, Jincheng

Gleb Kanterov  于2019年8月1日周四 下午5:09写道：

> Congratulations!
>
> On Thu, Aug 1, 2019 at 3:11 PM Reza Rokni  wrote:
>
>> Congratulations , awesome stuff !
>>
>> On Thu, 1 Aug 2019, 12:11 Maximilian Michels,  wrote:
>>
>>> Congrats, Jan! Good to see you become a committer :)
>>>
>>> On 01.08.19 12:37, Łukasz Gajowy wrote:
>>> > Congratulations!
>>> >
>>> > czw., 1 sie 2019 o 11:16 Robert Bradshaw >> > > napisał(a):
>>> >
>>> > Congratulations!
>>> >
>>> > On Thu, Aug 1, 2019 at 9:59 AM Jan Lukavský >> > > wrote:
>>> >
>>> > Thanks everyone!
>>> >
>>> > Looking forward to working with this great community! :-)
>>> >
>>> > Cheers,
>>> >
>>> >  Jan
>>> >
>>> > On 8/1/19 12:18 AM, Rui Wang wrote:
>>> > > Congratulations!
>>> > >
>>> > >
>>> > > -Rui
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:51 AM Robin Qiu <
>>> robi...@google.com
>>> > > > wrote:
>>> > >
>>> > > Congrats!
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:31 AM Aizhamal Nurmamat kyzy
>>> > > mailto:aizha...@apache.org>>
>>> wrote:
>>> > >
>>> > > Congratulations, Jan! Thank you for your
>>> contributions!
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:04 AM Tanay Tummalapalli
>>> > > mailto:ttanay...@gmail.com>>
>>> wrote:
>>> > >
>>> > > Congratulations!
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:05 PM Ahmet Altay
>>> > > mailto:al...@google.com>>
>>> wrote:
>>> > >
>>> > > Congratulations Jan! Thank you for your
>>> > > contributions!
>>> > >
>>> > > On Wed, Jul 31, 2019 at 2:30 AM Ankur Goenka
>>> > > mailto:goe...@google.com
>>> >>
>>> > > wrote:
>>> > >
>>> > > Congratulations Jan!
>>> > >
>>> > > On Wed, Jul 31, 2019, 1:23 AM David
>>> > > Morávek >> > > > wrote:
>>> > >
>>> > > Congratulations Jan, well deserved!
>>> ;)
>>> > >
>>> > > D.
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:17 AM Ryan
>>> > > Skraba >> > > > wrote:
>>> > >
>>> > > Congratulations Jan!
>>> > >
>>> > > On Wed, Jul 31, 2019 at 10:10 AM
>>> > > Ismaël Mejía >> > > >
>>> wrote:
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > Please join me and the rest of
>>> > > the Beam PMC in welcoming a new
>>> > > > committer: Jan Lukavský.
>>> > > >
>>> > > > Jan has been contributing to
>>> > > Beam for a while, he was part of
>>> > > the team
>>> > > > that contributed the Euphoria
>>> > > DSL extension, and he has done
>>> > > > interesting improvements for
>>> the
>>> > > Spark and Direct runner. He has
>>> also
>>> > > > been active in the community
>>> > > discussions around the Beam
>>> model and
>>> > > > other subjects.
>>> > > >
>>> > > > In consideration of Jan's
>>> > > contributions, the Beam PMC
>>> trusts
>>> > > him with
>>> > > > the responsibilities of a Beam
>>> > > committer [1].
>>> > > >
>>> > > > Thank you, Jan, for your
>>> > > contributions and looking forward
>>> > > to many more!
>>> > > >
>>> > > > Ismaël, on behalf of the Apache
>>> > > Beam PMC
>>> > > >
>>> > > > [1]
>>> > >
>>>

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-31 Thread jincheng sun

Thanks a lot for sharing the link. I take a quick look at the design and
the implementation in Java and think it could address my concern. It seems
that it's still not supported in the Python SDK Harness. Is there any plan
on that?

Robert Bradshaw  于2019年7月30日周二 下午12:33写道：

> On Tue, Jul 30, 2019 at 11:52 AM jincheng sun 
> wrote:
>
>>
>>>> Is it possible to add an interface such as `isSelfContained()` to the
>>>> `Coder`? This interface indicates
>>>> whether the serialized bytes are self contained. If it returns true,
>>>> then there is no need to add a prefixing length.
>>>> In this way, there is no need to introduce an extra protocol,  Please
>>>> correct me if I missed something :)
>>>>
>>>
>>> The question is how it is self contained. E.g. DoubleCoder is self
>>> contained because it always uses exactly 8 bytes, but one needs to know the
>>> double coder to leverage this. VarInt coder is self-contained a different
>>> way, as is StringCoder (which does just do prefixing).
>>>
>>
>> Yes, you are right! I think it again that we can not add such interface
>> for the coder, due to runner can not call it. And just one more thought:
>> does it make sense to add a method such as "registerSelfContained
>> Coder(xxx)" or so to let users register the coders which can be processed
>> in the SDK Harness?  It's the responsibility of the SDK harness to ensure
>> that the coder is supported.
>>
>
> Basically, a "please don't add length prefixing to this coder, assume
> everyone else can understand it (and errors will ensue if anyone doesn't)"
> at the user level? Seems a bit dangerous. Also, there is not "the
> SDK"--there may be multiple other SDKs in general, and of course runner
> components, some of which may understand the coder in question and some of
> which may not.
>
> I would say that if this becomes a problem, we could look at the pros and
> cons of various remedies, this being one alternative.
>
>
>>
>>
>>> I am hopeful that schemas give us a rich enough way to encode the vast
>>> majority of types that we will want to transmit across language barriers
>>> (possibly with some widening promotions). For high performance one will
>>> want to use formats like arrow rather than one-off coders as well, which
>>> also biases us towards the schema work. The set of StandardCoders is not
>>> closed, and nor is the possibility of figuring out a way to communicate
>>> outside this set for a particular pair of languages, but I think it makes
>>> sense to avoid going that direction unless we have to due to the increased
>>> API surface aread and complexity it imposes on all runners and SDKs.
>>>
>>
>> Great! Could you share some links about the schema work. It seems very
>> interesting and promising.
>>
>
> https://beam.apache.org/contribute/design-documents/#sql--schema and of
> particular relevance https://s.apache.org/beam-schemas
>
>
>

Re: Write-through-cache in State logic

2019-07-29 Thread jincheng sun

Hi Rakesh,

Glad to see you pointer this problem out!
+1 for add this implementation. Manage State by write-through-cache is
pretty important for Streaming job!

Best, Jincheng

Thomas Weise  于2019年7月29日周一 下午8:54写道：

> FYI a basic test appears to confirm the importance of the cross-bundle
> caching: I found that the throughput can be increased by playing with the
> bundle size in the Flink runner. Default caps at 1000 elements (or 1
> second). So on a high throughput stream the bundles would be capped by the
> count limit. Bumping the count limit increases the throughput by reducing
> the chatter over the state plane (more cache hits due to larger bundle).
>
> The next level of investigation would involve profiling. But just by
> looking at metrics, the CPU utilization on the Python worker side dropped
> significantly while on the Flink side it remains nearly same. There are no
> metrics for state operations on either side, I think it would be very
> helpful to get these in place also.
>
> Below the stateful processing code for reference.
>
> Thomas
>
>
> class StatefulFn(beam.DoFn):
> count_state_spec = userstate.CombiningValueStateSpec(
> 'count', beam.coders.IterableCoder(beam.coders.VarIntCoder()), sum)
> timer_spec = userstate.TimerSpec('timer',
> userstate.TimeDomain.WATERMARK)
>
> def process(self, kv, count=beam.DoFn.StateParam(count_state_spec),
> timer=beam.DoFn.TimerParam(timer_spec), window=beam.DoFn.WindowParam):
> count.add(1)
> timer_seconds = (window.end.micros // 100) - 1
> timer.set(timer_seconds)
>
> @userstate.on_timer(timer_spec)
> def process_timer(self, count=beam.DoFn.StateParam(count_state_spec),
> window=beam.DoFn.WindowParam):
> if count.read() == 0:
> logging.warning("###timer fired with count %d, window %s" %
> (count.read(), window))
>
>
>
> On Thu, Jul 25, 2019 at 5:09 AM Robert Bradshaw 
> wrote:
>
>> On Wed, Jul 24, 2019 at 6:21 AM Rakesh Kumar 
>> wrote:
>> >
>> > Thanks Robert,
>> >
>> >  I stumble on the jira that you have created some time ago
>> > https://jira.apache.org/jira/browse/BEAM-5428
>> >
>> > You also marked code where code changes are required:
>> >
>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L291
>> >
>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
>> >
>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L465
>> >
>> > I am willing to provide help to implement this. Let me know how I can
>> help.
>>
>> As far as I'm aware, no one is actively working on it right now.
>> Please feel free to assign yourself the JIRA entry and I'll be happy
>> to answer any questions you might have if (well probably when) these
>> pointers are insufficient.
>>
>> > On Tue, Jul 23, 2019 at 3:47 AM Robert Bradshaw 
>> wrote:
>> >>
>> >> This is documented at
>> >>
>> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
>> >> . Note that it requires participation of both the runner and the SDK
>> >> (though there are no correctness issues if one or the other side does
>> >> not understand the protocol, caching just won't be used).
>> >>
>> >> I don't think it's been implemented anywhere, but could be very
>> >> beneficial for performance.
>> >>
>> >> On Wed, Jul 17, 2019 at 6:00 PM Rakesh Kumar 
>> wrote:
>> >> >
>> >> > I checked the python sdk[1] and it has similar implementation as
>> Java SDK.
>> >> >
>> >> > I would agree with Thomas. In case of high volume event stream and
>> bigger cluster size, network call can potentially cause a bottleneck.
>> >> >
>> >> > @Robert
>> >> > I am interested to see the proposal. Can you provide me the link of
>> the proposal?
>> >> >
>> >> > [1]:
>> https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295
>> >> >
>> >> >
>> >> > On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise  wrote:
>> >> >>
>> >> >> Thanks for the pointer. For streaming, it will be important to
>> support caching across bundles. It appears that even the Java SDK doesn't
>> support that yet?
>> >> >>
>> >> >>
>> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
>> >> >>
>> >> >> Regarding clear/append: It would be nice if both could occur within
>> a single Fn Api roundtrip when the state is persisted.
>> >> >>
>> >> >> Thanks,
>> >> >> Thomas
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik 
>> wrote:
>> >> >>>
>> >> >>> User state is built on top of read, append and clear and not off a
>> read and write paradigm to allow for blind appends.
>> >> >>>
>> >> >>>

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-29 Thread jincheng sun

Hi Robert,

Thanks for your detail comments, I would have added a few pointers inline.

Best,
Jincheng

Robert Bradshaw  于2019年7月29日周一 下午12:35写道：

> On Sun, Jul 28, 2019 at 6:51 AM jincheng sun 
> wrote:
> >
> > Hi, Thomas & Robert, Thanks for your comments and providing relevant
> discussions and JIRA links, very helpful to me!
> >
> > I am glad to see your affirmative response,  And I am glad to add my
> thoughts on the comment you left:
> > -
> >
> > >> There are two distinct levels at which one can talk about a certain
> type of state being supported: the user-visible SDK's API and the runner's
> API. For example, BagState, ValueState, ReducingState, CombiningState,  can
> all be implemented on top of a runner-offered MapState in the SDK. On the
> one hand, there's a desire to keep the number of "primitive" states types
> to a minimum (to ease the use of authoring runners), but if a runner can
> perform a specific optimization due to knowing about the particular state
> type it might be preferable to pass it through rather than emulate it in
> the SDK.
> > ---
> > Agree. Regarding MapState, it's definitely needed as it cannot be
> implemented on top of the existing BagState.
> > Regarding ValueState, it can be implemented on top of BagState. However,
> we can do optimization if we know a state is ValueState.
> > For example, if a key is updated with a new value, if the ValueState is
> implemented on top of BagState, two RPC calls are needed
> > to write the new value back to runner: clear + append; if we know it's
> ValueState, just one RPC call is enough: set.
> > We can discuss case by case whether a state type is needed.
>
> In the Beam APIs [1] multiple state requests are consumed as a stream
> in a single RPC, so clear followed by append still has low overhead.
> Is that optimization not sufficient?
>
> [1]
> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L573
>
>
Actually there are two kinds of overhead:
1) the RPC overhead(I think in this point  may be sufficient for RPC)
2) the state read/write overhead, i.e., If there is no optimization, the
runner needs to clear the state firstly and then set a new value for the
state.


> > ---
> >
> > >> Note that in the protos, the GRPC ports have a coder attribute
> > specifically to allow this kind of customization (and the SDKs should
> > be respecting that). We've also talked about going beyond per-element
> > encodings (e.g. using arrow to serialize entire batches across the
> > whire). I think all the runner code simply uses the default and we
> > could be more intelligent there.
> > ---
> >
> > Yes, the gRPC allows to use customization coder. However, I'm afraid
> that this is not enough as we want to use
> > Beam's portability framework by depending on the modules used
> (beam-runners-java-fn-execution and the Python SDK Harness) instead
> > of copying that part of code to Flink. So it should also allow to use
> the customization coder in beam-runners-java-fn-execution.
> > Otherwise, we have to copy a lot of code to Flink to use the
> customization coder.
>
> Agreed, beam-runners-java-fn-execution does not take advantage of the
> full flexibility of the protocol, and would make a lot of sense to
> enhance it to be able to.
>
> > ---
> >
> > >> I'm wary of having too many buffer size configuration options (is
> > there a compelling reason to make it bigger or smaller?) but something
> > timebased would be very useful.
> > ---
> >
> > I think the default values of buffer size are not needed to change for
> most cases. I'm not sure for just one case: _DEFAULT_FLUSH_THRESHOLD=10MB.
> > Will 1MB makes more sense?
>
> IIRC, 10MB was the point at which, according to benchmarks Luke did
> quite a while ago, there was clearly no performance benefit in making
> it larger. Coupled with a time-based threshold, I don't see much of an
> advantage to lowering it.


My concern is that the SDK harness may be shared by a lot of runners and
there will be at least one write buffer for each runner. Is it possible
that there are too many write buffers used which take up a lot of memory
and users want to lower it? Nevertheless, I think this problem is not
critical considering that we all agree a time-based threshold should be
supported. :)


> > ---
> >
> > >> The idea of StandardCoders is it's a set of coders that all runners
> > and SDKs can be assumed to understand. If you have an element encoded
> > with

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-27 Thread jincheng sun

Hi, Thomas & Robert, Thanks for your comments and providing relevant
discussions and JIRA links, very helpful to me!

I am glad to see your affirmative response,  And I am glad to add my
thoughts on the comment you left:
-

>> There are two distinct levels at which one can talk about a certain type
of state being supported: the user-visible SDK's API and the runner's API.
For example, BagState, ValueState, ReducingState, CombiningState,  can all
be implemented on top of a runner-offered MapState in the SDK. On the one
hand, there's a desire to keep the number of "primitive" states types to a
minimum (to ease the use of authoring runners), but if a runner can perform
a specific optimization due to knowing about the particular state type it
might be preferable to pass it through rather than emulate it in the SDK.
---
Agree. Regarding MapState, it's definitely needed as it cannot be
implemented on top of the existing BagState.
Regarding ValueState, it can be implemented on top of BagState. However, we
can do optimization if we know a state is ValueState.
For example, if a key is updated with a new value, if the ValueState is
implemented on top of BagState, two RPC calls are needed
to write the new value back to runner: clear + append; if we know it's
ValueState, just one RPC call is enough: set.
We can discuss case by case whether a state type is needed.

---

>> Note that in the protos, the GRPC ports have a coder attribute
specifically to allow this kind of customization (and the SDKs should
be respecting that). We've also talked about going beyond per-element
encodings (e.g. using arrow to serialize entire batches across the
whire). I think all the runner code simply uses the default and we
could be more intelligent there.
---

Yes, the gRPC allows to use customization coder. However, I'm afraid that
this is not enough as we want to use
Beam's portability framework by depending on the modules used
(beam-runners-java-fn-execution and the Python SDK Harness) instead
of copying that part of code to Flink. So it should also allow to use the
customization coder in beam-runners-java-fn-execution.
Otherwise, we have to copy a lot of code to Flink to use the customization
coder.

---

>> I'm wary of having too many buffer size configuration options (is
there a compelling reason to make it bigger or smaller?) but something
timebased would be very useful.
---

I think the default values of buffer size are not needed to change for most
cases. I'm not sure for just one case: _DEFAULT_FLUSH_THRESHOLD=10MB.
Will 1MB makes more sense?


---

>> The idea of StandardCoders is it's a set of coders that all runners
and SDKs can be assumed to understand. If you have an element encoded
with something other Coder, then there's no way to know if the other
side will be able to decode it (or, indeed, even properly detect
element boundaries in the stream of contiguous encoded elements).
Adding a length prefixed coder wrapping allows the other side to at
least pull it out and pass it around as encoded bytes. In other words,
whether an encoded element needs length prefixing is a function of the
other process you're trying to communicate with (and we don't have the
mechanisms, and I'm not sure it's worth the complexity, to do some
kind of coder negotiation between the processes here.) Of course for a
UDF, if the other side does not know about the element type in
question it'd be difficult (in general) to meaningfully process the
element.

The work on schemas, and making those portable, will result in a much
richer set of element types that can be passed through "standard"
coders.
---

The design makes sense to me. My concern is that if a coder is not among
the StandardCoders, it will be prefixed with a length even if the harness
knows how to decode it. Besides, I'm also curious about the standard
whether a coder can be put into StandardCoders.
For example, I noticed that FLOAT is not among StandardCoders, while DOUBLE
is among it.

Best, Jincheng

Robert Bradshaw  于2019年7月25日周四 下午2:00写道：

> On Thu, Jul 25, 2019 at 5:31 AM Thomas Weise  wrote:
> >
> > Hi Jincheng,
> >
> > It is very exciting to see this follow-up, that you have done your
> research on the current state and that there is the intention to join
> forces on the portability effort!
> >
> > I have added a few pointers inline.
> >
> > Several of the issues you identified affect our usage of Beam as well.
> These present an opportunity for collaboration.
>
> +1, a lot of this aligns with improvements we'd like to make as well.
>
> > On Wed, Jul 24, 2019 at 2:53 AM jincheng sun 
> wrote:
> >>
> >> Hi all,
> >>
> >> Thanks Max and all of your kind words. :)
> >>
> >> S

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-07-24 Thread jincheng sun

Hi all,

Thanks Max and all of your kind words. :)

Sorry for the late reply as I'm busy working on the Flink 1.9 release. For
the next major release of Flink, we plan to add Python user defined
functions(UDF, UDTF, UDAF) support in Flink and I have go over the Beam
portability framework and think that it is perfect for our requirements.
However we also find some improvements needed for Beam:

Must Have:

1) Currently only BagState is supported in gRPC protocol and I think we
should support more kinds of state types, such as MapState, ValueState,
ReducingState, CombiningState(AggregatingState in Flink), etc. That's
because these kinds of state will be used in both user-defined function or
Flink Python DataStream API.

2) There are warnings that Python 3 is not fully supported in Beam
(beam/sdks/python/setup.py). We should support Python 3.x for the beam
portability framework due to Python 2 will be not supported officially.

3) The configuration "semi_persist_dir" is not set in EnvironmentFactory at
the runner side. Why I think it's  must to have is because when the
environment type is "PROCESS", the default value "/tmp" may become a big
problem.

4) The buffer size configure policy should be improved, such as:
   At runner side, the buffer limit in BeamFnDataBufferingOutboundObserver
is size based. We should also support time based especially for the
streaming case.
   At Python SDK Harness, the buffer size is not configurable in
GrpcDataService. The input queue size of the input buffer in Python SDK
Harness is not size limited.
  The flush threshold of the output buffer in Python SDK Harness is 10 MB
by default (_DEFAULT_FLUSH_THRESHOLD=10MB). My suggestion is: make the
threshold configurable and support time based threshold.

Nice To Have:
---
1) Improves the interfaces of FnDataService, BundleProcessor, ActiveBundle,
etc, to change the parameter type from WindowedValue to T. (We have
already discussed in the previous mails)

2) Refactor the code to avoid unnecessary dependencies pull in. For
example, beam-sdks-java-core(11MB) is a package for Java SDK users and it
is pull in because there are a few classes in beam-sdks-java-core are used
in beam-runners-java-fn-execution, such as:
PipelineOptions used in DefaultJobBundleFactory FileSystems used in
BeamFileSystemArtifactRetrievalService.
It means maybe we can add a new module such as beam-sdks-java-common to
hold the classes used by both runner and SDK.

3) State cache is not shared between bundles which is performance critical
for streaming jobs.

4) The coder of WindowedValue cannot be configured and most of time we
don't need to serialize and deserialize the timestamp, window and pane
properties in Flink. But currently FullWindowedValueCoder is used by
default in WireCoders.addWireCoder, I suggest to make the coder
configurable (i.e. allowing to use ValueOnlyWindowedValueCoder)

5) Currently if a coder is not defined in StandardCoders, it will be
wrapped with LengthPrefixedCoder (WireCoders.addWireCoder ->
LengthPrefixUnknownCoders.addLengthPrefixedCoder). However, only a few
coders are defined in StandardCoders. It means that for most coder, a
length will be added to the serialized bytes which is not necessary in my
thoughts. My suggestion is maybe we can add some interfaces or tags for the
coder which indicate whether the coder is needed a length prefix or not.

6) Set log level according to PipelineOption in Python SDK Harness.
Currently the log level is set to INFO by default.

7) Allows to start up StatusServer according to PipelineOption in Python
SDK Harness. Currently the StatusServer is start up by default.

Although I put 3) 4) 5) into the "Nice to Have" as they are performance
related, I still think they are very critical for Python UDF execution
performance.

Open questions:
-
1) Which coders should be / can be defined in StandardCoders?

Currently we are preparing the design of how to support Python UDF in Flink
based on the Beam portability framework and we will bring up the discussion
in Flink community. We may propose more changes for Beam during that time
and may need more support from Beam community.

To be honest, I'm not an expert of Beam and so please feel free to correct
me if my understanding is wrong. Welcome any feedback.

Best,
Jincheng

Maximilian Michels  于2019年4月25日周四 上午12:14写道：

> Fully agree that this is an effort that goes beyond changing a type
> parameter but I think we have a chance here to cooperate between the two
> projects. I would be happy to help out where I can.
>
> I'm not sure at this point what exactly is feasible for reuse but I
> would imagine the Runner-related code to be useful as well for the
> interaction with the SDK Harness. There are some fundamental differences
> in the model, e.g. how windowing works, which might be challenging to
> work around.
>
> Thanks,
> Max
&g

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-06-11 Thread jincheng sun

Hi Pablo, Congratulations!

Best,
Jincheng

Rakesh Kumar  于2019年6月12日周三 上午6:00写道：

> Congratulations Pablo!!!
>
> On Mon, Jun 10, 2019 at 6:03 AM Gleb Kanterov  wrote:
>
>> Congratulations!
>>
>> On Fri, May 24, 2019 at 9:50 PM Joana Filipa Bernardo Carrasqueira <
>> joanafil...@google.com> wrote:
>>
>>> Congratulations Pablo! Well deserved :D
>>>
>>> On Fri, May 17, 2019 at 3:14 PM Hannah Jiang 
>>> wrote:
>>>
 Congratulations, Pablo, you deserve it!

 *From: *Mark Liu 
 *Date: *Fri, May 17, 2019 at 2:45 PM
 *To: * 

 Congratulations, Pablo!
>
> *From: *Alexey Romanenko 
> *Date: *Fri, May 17, 2019 at 2:12 AM
> *To: *dev
>
> Congratulations, Pablo!
>>
>> On 16 May 2019, at 20:38, Rui Wang  wrote:
>>
>> Congrats! Congrats! Congrats!
>>
>> -Rui
>>
>> On Thu, May 16, 2019 at 9:45 AM Udi Meiri  wrote:
>>
>>> Congrats Pablo!
>>>
>>> On Thu, May 16, 2019 at 9:27 AM Thomas Weise  wrote:
>>>
 Congratulations, Pablo!


 On Thu, May 16, 2019 at 5:03 AM Katarzyna Kucharczyk <
 ka.kucharc...@gmail.com> wrote:

> Wow, great news!  Congratulations, Pablo!
>
> On Thu, May 16, 2019 at 1:28 PM Michał Walenia <
> michal.wale...@polidea.com> wrote:
>
>> Congratulations, Pablo!
>>
>> On Thu, May 16, 2019 at 1:55 AM Rose Nguyen 
>> wrote:
>>
>>> Congrats, Pablo!!
>>>
>>> On Wed, May 15, 2019 at 4:43 PM Heejong Lee 
>>> wrote:
>>>
 Congratulations!

 On Wed, May 15, 2019 at 12:24 PM Niklas Hansson <
 niklas.sven.hans...@gmail.com> wrote:

> Congratulations Pablo :)
>
> Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang <
> ruo...@google.com>:
>
>> Congratulations, Pablo!
>>
>> *From: *Charles Chen 
>> *Date: *Wed, May 15, 2019 at 11:04 AM
>> *To: *dev
>>
>> Congrats Pablo and thank you for your contributions!
>>>
>>> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Congrats, Pablo!

 On Wed, May 15, 2019 at 10:41 AM Yifan Zou <
 yifan...@google.com> wrote:

> Congratulations, Pablo!
>
> *From: *Maximilian Michels 
> *Date: *Wed, May 15, 2019 at 2:06 AM
> *To: * 
>
> Congrats Pablo! Thank you for your help to grow the Beam
>> community!
>>
>> On 15.05.19 10:33, Tim Robertson wrote:
>> > Congratulations Pablo
>> >
>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía <
>> ieme...@gmail.com
>> > > wrote:
>> >
>> > Congrats Pablo, well deserved, nece to see your
>> work recognized!
>> >
>> > On Wed, May 15, 2019 at 9:59 AM Pei HE <
>> pei...@gmail.com
>> > > wrote:
>> >  >
>> >  > Congrats, Pablo!
>> >  >
>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay
>> Tummalapalli
>> >  > > ttanay.apa...@gmail.com>> wrote:
>> >  > >
>> >  > > Congratulations Pablo!
>> >  > >
>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>> adude3...@gmail.com
>> > > wrote:
>> >  > >>
>> >  > >> Congrats, Pablo!
>> >  > >>
>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell
>> O'Callaghan
>> > mailto:conne...@google.com>>
>> wrote:
>> >  > >>>
>> >  > >>> Awesome well done Pablo!!!
>> >  > >>>
>> >  > >>> Kenn thank you for sharing this great news
>> with us!!!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  > 
>> >  >  Congratulations!
>> >  > 
>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>> > mailto:rob...@frantil.com>>
>> wrote:
>> >  >

[RESULT] Release flink-shaded 7.0, release candidate

2019-05-30 Thread jincheng sun

Hi all,

I'm happy to announce that we have unanimously approved this release.

There are 6 approving votes, 3 of which are binding:
* Chesnay
* Timo
* Hequn
* Till
* Nico
* Jincheng

There are no disapproving votes.

Thanks, everyone!

Cheers,
Jincheng

Re: [ANNOUNCE] New committer announcement: Yifan Zou

2019-04-25 Thread jincheng sun

Congrats Yifan!

Griselda Cuevas  于2019年4月26日周五 上午1:56写道：

> Congrats Yifan!
>
>
>
>
> On Mon, 22 Apr 2019 at 08:26, Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Yifan Zou.
>>
>> Yifan has been contributing to Beam since early 2018. He has proposed
>> 70+ pull requests, adding dependency checking and improving test
>> infrastructure. But something the numbers cannot show adequately is the
>> huge effort Yifan has put into working with infra and keeping our Jenkins
>> executors healthy.
>>
>> In consideration of Yian's contributions, the Beam PMC trusts Yifan with
>> the responsibilities of a Beam committer [1].
>>
>> Thank you, Yifan, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam
>> -committer
>>
>

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread jincheng sun

Hi Robert,

In addition to the questions described by Dian, I also want to know what
difficult problems Py4j's solution will encounter in add UDF support, which
you mentioned as follows:

Using something like Py4j is an easy way to get up an running, especially
> for a very faithful API, but the instant one wants to add UDFs one hits a
> cliff of sorts (which is surmountable, but likely a lot harder than having
> gone the above approach).


I appreciate if you can share more specific cases?

Thanks,
Jincheng

Dian Fu  于2019年4月25日周四 上午11:53写道：

> Thanks everyone for the discussion here.
>
> Regarding to the Java/Scala UDF and the built-in UDF to execute in the
> current Flink way (directly in JVM, not via RPC), I share the same thoughts
> with Max and Robert and I think it will not be a big problem. From the
> design doc, I guess the main reason to take the Py4J way instead of the DAG
> way at present is that DAG has some limitations in some scenarios such as
> interactive programing which may be a strong requirement for data scientist.
>
> > In addition (and I'll admit this is rather subjective) it seems to me
> one of the primary values of a table-like API in a given language (vs. just
> using (say) plain old SQL itself via a console) is the ability to embed it
> in a larger pipeline, or at least drop in operations that are not (as)
> naturally expressed in the "table way," including existing libraries. In
> other words, a full SDK. The Py4j wrapping doesn't extend itself to such
> integration nearly as easily.
>
>
> Hi Robert, regarding to "a larger pipeline", do you mean translating a
> table-like API jobs from/to another kind of API job or embedding third-part
> libraries into a table-like API jobs via UDF? Could you kindly explain why
> this would be a problem for Py4J and will not be a problem if expressing
> the job with DAG?
>
> Thanks,
> Dian
>
>
> > 在 2019年4月25日，上午12:16，Robert Bradshaw  写道：
> >
> > Thanks for the meeting summary, Stephan. Sound like you covered a lot of
> ground. Some more comments below, adding onto what Max has said.
> >
> > On Wed, Apr 24, 2019 at 3:20 PM Maximilian Michels  > wrote:
> > >
> > > Hi Stephan,
> > >
> > > This is excited! Thanks for sharing. The inter-process communication
> > > code looks like the most natural choice as a common ground. To go
> > > further, there are indeed some challenges to solve.
> >
> > It certainly does make sense to share this work, though it does to me
> seem like a rather low level to integrate at.
> >
> > > > => Biggest question is whether the language-independent DAG is
> expressive enough to capture all the expressions that we want to map
> directly to Table API expressions. Currently much is hidden in opaque UDFs.
> Kenn mentioned the structure should be flexible enough to capture more
> expressions transparently.
> > >
> > > Just to add some context how this could be done, there is the concept
> of
> > > a FunctionSpec which is part of a transform in the DAG. FunctionSpec
> > > contains a URN and with a payload. FunctionSpec can be either (1)
> > > translated by the Runner directly, e.g. map to table API concepts or
> (2)
> > > run a user-defined function with an Environment. It could be feasible
> > > for Flink to choose the direct path, whereas Beam Runners would
> leverage
> > > the more generic approach using UDFs. Granted, compatibility across
> > > Flink and Beam would only work if both of the translation paths yielded
> > > the same semantics.
> >
> > To elaborate a bit on this, Beam DAGs are built up by applying
> Transforms (basically operations) to PColections (the equivalent of
> dataset/datastream), but the key point here is that these transforms are
> often composite operations that expand out into smaller subtransforms. This
> expansion happens during pipeline construction, but with the recent work on
> cross language pipelines can happen out of process. This is one point of
> extendability. Secondly, and importantly, this composite structure is
> preserved in the DAG, and so a runner is free to ignore the provided
> expansion and supply its own (so long as semantically it produces exactly
> the same output). These composite operations can be identified by arbitrary
> URNs + payloads, and any runner that does not understand them simply uses
> the pre-provided expansion.
> >
> > The existing Flink runner operates on exactly this principle,
> translating URNs for the leaf operations (Map, Flatten, ...) as well as
> some composites it can do better (e.g. Reshard). It is intentionally easy
> to define and add new ones. This actually seems the easier approach (to me
> at least, but that's probably heavily influenced by what I'm familiar with
> vs. what I'm not).
> >
> > As for how well this maps onto the Flink Tables API, part of that
> depends on how much of the API is the operations themselves, and how much
> is concerning configuration/environment/etc. which is harder to talk about
> in an

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-24 Thread jincheng sun

Hi Kenn, I think you are right, the Python SDK harness can be shared to
Flink, and also need to add some new primitive operations. Regarding
runner-side, I think most of the code which in runners/java-fun- Execution
is can be shared(but need some improvement, such as FnDataService), some of
them cannot be shared, such as job submission code. So, we may need to set
up a document to clearly analyze which ones can be shared, which ones can
be shared but need to do some changes, and which ones are definitely cannot
be shared.

Hi Max, Thanks for sharing your opinion, I also prefer to using beam Fn
service as a library, also willing to do more efforts for this.
>From the view of the current code, abstracting Fn Service into a class
library that other projects can rely on requires a lot of effort from the
Beam community. Turn `WindowedValue` into `T` is just the beginning of
this effort. If the Beam community is willing on abstracting Fn Service
into a class library that can be relied upon by other projects, I can try
to draft a document, of course during this period may need a lot of help
from you, Kenn, Lukasz, and the Beam community. (I am a recruit in the Beam
community :-))

What do you think?

Regards,
Jincheng

Kenneth Knowles  于2019年4月24日周三 上午3:32写道：

> It seems to me that the most valuable code to share and keep up with is
> the Python/Go/etc SDK harness; they would need to be enhanced with new
> primitive operations. So you would want to depend directly and share the
> original proto-generated classes too, which Beam publishes as separate
> artifacts for Java. Is the runner-side support code that valuable for
> direct integration into Flink? I would expect once you get past trivial
> wrappers (that you can copy/paste with no loss) you would hit differences
> in architecture so you would diverge anyhow.
>
> Kenn
>
> On Tue, Apr 23, 2019 at 5:32 AM Maximilian Michels  wrote:
>
>> Hi Jincheng,
>>
>> Copying code is a solution for the short term. In the long run I'd like
>> the Fn services to be a library not only for the Beam portability layer
>> but also for other projects which want to leverage it. We should thus
>> make an effort to make it more generic/extensible where necessary and
>> feasible.
>>
>> Since you are investigating reuse of Beam portability in the context of
>> Flink, do you think it would make sense to setup a document where we
>> collect ideas and challenges?
>>
>> Thanks,
>> Max
>>
>> On 23.04.19 13:00, jincheng sun wrote:
>> > Hi Reuven,
>> >
>> > I think you have provided an optional solution for other community
>> which
>> > wants to take advantage of Beam's existing achievements. Thank you very
>> > much!
>> >
>> > I think the Flink community can choose to copy from Beam's code or
>> > choose to rely directly on the beam's class library. The Flink
>> community
>> > also initiated a discussion, more info can be found here
>> > <
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html#a28096
>> >
>> >
>> > The purpose of Turns `WindowedValue` into `T` is to promote the
>> > interface design of Beam more versatile, so that other open source
>> > projects have the opportunity to take advantage of Beam's existing
>> > achievements. Of course, just changing the `WindowedValue` into `T`
>> > is not enough to be shared by other projects in the form of a class
>> > library, we need to do more efforts. If Beam can provide a class
>> library
>> > in the future, other community contributors will also have the
>> > willingness to contribute to the beam community. This will benefit both
>> > the community that wants to take advantage of Beam's existing
>> > achievements and the Beam community itself. And thanks to Thomas for
>> > that he has also made a lot of efforts in this regard.
>> >
>> > Thanks again for your valuable suggestion, and welcome any feedback!
>> >
>> > Best,
>> > Jincheng
>> >
>> > Reuven Lax mailto:re...@google.com>> 于2019年4月23日
>> > 周二 上午1:00写道：
>> >
>> > One concern here: these interfaces are intended for use within the
>> > Beam project. Beam may decide to make specific changes to them to
>> > support needed functionality in Beam. If they are being reused by
>> > other projects, then those changes risk breaking those other
>> > projects in unexpected ways. I don't think we can guarantee that we
>> > don't do that. If this is useful in Flink, it would be safer

Re: [DISCUSS] FLIP-38 Support python language in flink TableAPI

2019-04-24 Thread jincheng sun

ructure as
> Flink's Table API that produces the language-independent DAG
>
> *Short-term approach in Flink*
>
>   - Goal is to not block Flink's Python effort on the long term approach
> and the necessary design and evolution of the language-independent DAG.
>   - Depending on what the outcome of above investigation is, Flink may
> initially go with a simple approach to map the Python Table API to the the
> Java Table API via Py4J, as outlined in FLIP-38:
> https://docs.google.com/document/d/1ybYt-0xWRMa1Yf5VsuqGRtOfJBz4p74ZmDxZYg3j_h8
>
>
>
> On Tue, Apr 23, 2019 at 4:14 AM jincheng sun 
> wrote:
>
>> Hi everyone,
>>
>> Thank you for all of your feedback and comments in google doc!
>>
>> I have updated the google doc and add the UDFs part. For a short summary:
>>
>>   - Python TableAPI - Flink introduces a set of Python Table API
>> Interfaces
>> which align with Flink Java Table API. It uses Py4j framework to
>> communicate between Python VM  and Java VM.
>>   - Python User-defined functions - IMO. Flink supports the communication
>> framework of UDFs, we will try to reuse the existing achievements of Beam
>> as much as possible, and do our best for this. The first step is
>>   to solve the above interface definition problem, which turns `
>> WindowedValue` into `T` in the FnDataService and BeamFnDataClient
>> interface definition, has been discussed in the Beam community.
>>
>> The detail can be fonded here:
>>
>> https://docs.google.com/document/d/1ybYt-0xWRMa1Yf5VsuqGRtOfJBz4p74ZmDxZYg3j_h8/edit?usp=sharing
>>
>> So we can start the development of Table API without UDFs in Flink, and
>> work with the Beam community to promote the abstraction of Beam.
>>
>> What do you think?
>>
>> Regards,
>> Jincheng
>>
>> jincheng sun  于2019年4月17日周三 下午4:01写道：
>>
>> > Hi Stephan,
>> >
>> > Thanks for your suggestion and summarize. :)
>> >
>> >  ==> The FLIP should probably reflect the full goal rather than the
>> >> first implementation step only, this would make sure everyone
>> understands
>> >> what the final goal of the effort is.
>> >
>> >
>> > I totally agree that we can implement the function in stages, but FLIP
>> > needs to reflect the full final goal. I agree with Thomas and you,  I
>> will
>> > add the design of the UDF part later.
>> >
>> > Yes, you are right, currently, we only consider the `flink run` and
>> > `python-shell` as the job entry point. and we should add REST API for
>> > another entry point.
>> >
>> > It would be super cool if the Python API would work seamlessly with all
>> >> modes of starting Flink jobs.
>> >
>> >
>> > If my understand you correctly, support Python TableAPI in Kubernetes,
>> we
>> > only need to increase (or improve the existing) REST API corresponding
>> to
>> > the Python Table API, of course, it also may need to release Docker
>> Image
>> > that supports Python, it will easily deploy Python TableAPI into
>> > Kubernetes.
>> >
>> > So, Finally, we support the following ways to submit Python TableAPI:
>> > - Python Shell - interactive development.
>> > - CLI - submit the job by `flink run`. e.g: deploy job into the yarn
>> > cluster.
>> > - REST - submit the job by REST API. e.g: deploy job into the kubernetes
>> > cluster.
>> >
>> > Please correct me if there are any incorrect understanding.
>> >
>> > Thanks,
>> > Jincheng
>> >
>> >
>> > Stephan Ewen  于2019年4月12日周五 上午12:22写道：
>> >
>> >> One more thought:
>> >>
>> >> The FLIP is very much centered on the CLI and it looks like it has
>> mainly
>> >> batch jobs and session clusters in mind.
>> >>
>> >> In very many cases, especially in streaming cases, the CLI (or shell)
>> is
>> >> not the entry point for a program.
>> >> See for example the use of Flink jobs on Kubernetes (Container Mode /
>> >> Entrypoint).
>> >>
>> >> It would be super cool if the Python API would work seamlessly with all
>> >> modes of starting Flink jobs.
>> >> That would make i available to all users.
>> >>
>> >> On Thu, Apr 11, 2019 at 5:34 PM Stephan Ewen  wrote:
>> >>
>> >> > Hi all!
>> >> >
>> >> > I think that all the opinions and ideas are not actually in
&

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-23 Thread jincheng sun

Hi Reuven,

I think you have provided an optional solution for other community which
wants to take advantage of Beam's existing achievements. Thank you very
much!

I think the Flink community can choose to copy from Beam's code or choose
to rely directly on the beam's class library. The Flink community also
initiated a discussion, more info can be found here
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html#a28096>

The purpose of Turns `WindowedValue` into `T` is to promote the
interface design of Beam more versatile, so that other open source projects
have the opportunity to take advantage of Beam's existing achievements. Of
course, just changing the `WindowedValue` into `T` is not enough to be
shared by other projects in the form of a class library, we need to do more
efforts. If Beam can provide a class library in the future, other community
contributors will also have the willingness to contribute to the beam
community. This will benefit both the community that wants to take
advantage of Beam's existing achievements and the Beam community itself.
And thanks to Thomas for that he has also made a lot of efforts in this
regard.

Thanks again for your valuable suggestion, and welcome any feedback!

Best,
Jincheng

Reuven Lax  于2019年4月23日周二 上午1:00写道：

> One concern here: these interfaces are intended for use within the Beam
> project. Beam may decide to make specific changes to them to support needed
> functionality in Beam. If they are being reused by other projects, then
> those changes risk breaking those other projects in unexpected ways. I
> don't think we can guarantee that we don't do that. If this is useful in
> Flink, it would be safer to copy the code IMO rather than to directly
> depend on it.
>
> On Mon, Apr 22, 2019 at 12:08 AM jincheng sun 
> wrote:
>
>> Hi Kenn,
>>
>> Thanks for your reply, and explained the design of WindowValue clearly!
>>
>> At present, the definitions of `FnDataService` and `BeamFnDataClient` in
>> Data Plane are very clear and universal, such as: send(...)/receive(...).
>> If it is only applied in the project of Beam, it is already very good.
>> Because `WindowValue` is a very basic data structure in the Beam project,
>> both the Runner and the SDK harness have define the WindowedValue data
>> structure.
>>
>> The reason I want to change the interface parameter from
>> `WindowedValue` to T is because I want to make the `Data Plane`
>> interface into a class library that can be used by other projects (such as
>> Apache Flink), so that other projects Can have its own `FnDataService`
>> implementation. However, the definition of `WindowedValue` does not apply
>> to all projects. For example, Apache Flink also has a definition similar to
>> WindowedValue. For example, Apache Flink Stream has StreamRecord. If we
>> change `WindowedValue` to T, then other project's implementation does
>> not need to wrap WindowedValue, the interface will become more concise.
>> Furthermore,  we only need one T, such as the Apache Flink DataSet operator.
>>
>> So, I agree with your understanding, I don't expect `WindowedValueXXX`
>> in the FnDataService interface, I hope to just use a `T`.
>>
>> Have you seen some problem if we change the interface parameter from
>> `WindowedValue` to T?
>>
>> Thanks,
>> Jincheng
>>
>> Kenneth Knowles  于2019年4月20日周六 上午2:38写道：
>>
>>> WindowedValue has always been an interface, not a concrete
>>> representation:
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java
>>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L52>.
>>> It is an abstract class because we started in Java 7 where you could not
>>> have default methods, and just due to legacy style concerns. it is not just
>>> discussed, but implemented, that there are WindowedValue implementations
>>> with fewer allocations.
>>> At the coder level, it was also always intended to have multiple
>>> encodings. We already do have separate encodings based on whether there is
>>> 1 window or multiple windows. The coder for a particular kind of
>>> WindowedValue should decide this. Before the Fn API none of this had to be
>>> standardized, because the runner could just choose whatever it wants. Now
>>> we have to standardize any encodings that runners and harnesses both need
>>> to know. There should be many, and adding more should be just a matter of
>>> standardization, no new design.
>>>
>>> N

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-22 Thread jincheng sun

Hi Kenn,

Thanks for your reply, and explained the design of WindowValue clearly!

At present, the definitions of `FnDataService` and `BeamFnDataClient` in
Data Plane are very clear and universal, such as: send(...)/receive(...).
If it is only applied in the project of Beam, it is already very good.
Because `WindowValue` is a very basic data structure in the Beam project,
both the Runner and the SDK harness have define the WindowedValue data
structure.

The reason I want to change the interface parameter from `WindowedValue`
to T is because I want to make the `Data Plane` interface into a class
library that can be used by other projects (such as Apache Flink), so that
other projects Can have its own `FnDataService` implementation. However,
the definition of `WindowedValue` does not apply to all projects. For
example, Apache Flink also has a definition similar to WindowedValue. For
example, Apache Flink Stream has StreamRecord. If we change
`WindowedValue` to T, then other project's implementation does not need
to wrap WindowedValue, the interface will become more concise.
Furthermore,  we only need one T, such as the Apache Flink DataSet operator.

So, I agree with your understanding, I don't expect `WindowedValueXXX`
in the FnDataService interface, I hope to just use a `T`.

Have you seen some problem if we change the interface parameter from
`WindowedValue` to T?

Thanks,
Jincheng

Kenneth Knowles  于2019年4月20日周六 上午2:38写道：

> WindowedValue has always been an interface, not a concrete representation:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java
> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java#L52>.
> It is an abstract class because we started in Java 7 where you could not
> have default methods, and just due to legacy style concerns. it is not just
> discussed, but implemented, that there are WindowedValue implementations
> with fewer allocations.
> At the coder level, it was also always intended to have multiple
> encodings. We already do have separate encodings based on whether there is
> 1 window or multiple windows. The coder for a particular kind of
> WindowedValue should decide this. Before the Fn API none of this had to be
> standardized, because the runner could just choose whatever it wants. Now
> we have to standardize any encodings that runners and harnesses both need
> to know. There should be many, and adding more should be just a matter of
> standardization, no new design.
>
> None of this should be user-facing or in the runner API / pipeline graph -
> that is critical to making it flexible on the backend between the runner &
> SDK harness.
>
> If I understand it, from our offline discussion, you are interested in the
> case where you issue a ProcessBundleRequest to the SDK harness and none of
> the primitives in the subgraph will ever observe the metadata. So you want
> to not even have a tiny
> "WindowedValueWithNoMetadata". Is that accurate?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 10:17 AM jincheng sun 
> wrote:
>
>> Thank you! And have a nice weekend!
>>
>>
>> Lukasz Cwik  于2019年4月20日周六 上午1:14写道：
>>
>>> I have added you as a contributor.
>>>
>>> On Fri, Apr 19, 2019 at 9:56 AM jincheng sun 
>>> wrote:
>>>
>>>> Hi Lukasz,
>>>>
>>>> Thanks for your affirmation and provide more contextual information. :)
>>>>
>>>> Would you please give me the contributor permission?  My JIRA ID is
>>>> sunjincheng121.
>>>>
>>>> I would like to create/assign tickets for this work.
>>>>
>>>> Thanks,
>>>> Jincheng
>>>>
>>>> Lukasz Cwik  于2019年4月20日周六 上午12:26写道：
>>>>
>>>>> Since I don't think this is a contentious change.
>>>>>
>>>>> On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:
>>>>>
>>>>>> Yes, using T makes sense.
>>>>>>
>>>>>> The WindowedValue was meant to be a context object in the SDK harness
>>>>>> that propagates various information about the current element. We have
>>>>>> discussed in the past about:
>>>>>> * making optimizations which would pass around less of the context
>>>>>> information if we know that the DoFns don't need it (for example, all the
>>>>>> values share the same window).
>>>>>> * versioning the encoding separately from the WindowedValue context
>>>>>> object (see recent discussion about element timestamp precision [1])
>>>>>> * the ru

Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread jincheng sun

Thank you! And have a nice weekend!


Lukasz Cwik  于2019年4月20日周六 上午1:14写道：

> I have added you as a contributor.
>
> On Fri, Apr 19, 2019 at 9:56 AM jincheng sun 
> wrote:
>
>> Hi Lukasz,
>>
>> Thanks for your affirmation and provide more contextual information. :)
>>
>> Would you please give me the contributor permission?  My JIRA ID is
>> sunjincheng121.
>>
>> I would like to create/assign tickets for this work.
>>
>> Thanks,
>> Jincheng
>>
>> Lukasz Cwik  于2019年4月20日周六 上午12:26写道：
>>
>>> Since I don't think this is a contentious change.
>>>
>>> On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:
>>>
>>>> Yes, using T makes sense.
>>>>
>>>> The WindowedValue was meant to be a context object in the SDK harness
>>>> that propagates various information about the current element. We have
>>>> discussed in the past about:
>>>> * making optimizations which would pass around less of the context
>>>> information if we know that the DoFns don't need it (for example, all the
>>>> values share the same window).
>>>> * versioning the encoding separately from the WindowedValue context
>>>> object (see recent discussion about element timestamp precision [1])
>>>> * the runner may want its own representation of a context object that
>>>> makes sense for it which isn't a WindowedValue necessarily.
>>>>
>>>> Feel free to cut a JIRA about this and start working on a change
>>>> towards this.
>>>>
>>>> 1:
>>>> https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E
>>>>
>>>> On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
>>>> wrote:
>>>>
>>>>> Hi Beam devs,
>>>>>
>>>>> I read some of the docs about `Communicating over the Fn API` in Beam.
>>>>> I feel that Beam has a very good design for Control Plane/Data Plane/State
>>>>> Plane/Logging services, and it is described in >>>> data> document. When communicating between Runner and SDK Harness, the
>>>>> DataPlane API will be WindowedValue(An immutable triple of value,
>>>>> timestamp, and windows.) As a contract object between Runner and SDK
>>>>> Harness. I see the interface definitions for sending and receiving data in
>>>>> the code as follows:
>>>>>
>>>>> - org.apache.beam.runners.fnexecution.data.FnDataService
>>>>>
>>>>> public interface FnDataService {
>>>>>>InboundDataClient receive(LogicalEndpoint inputLocation,
>>>>>> Coder> coder, FnDataReceiver> 
>>>>>> listener);
>>>>>>CloseableFnDataReceiver> send(
>>>>>>   LogicalEndpoint outputLocation, Coder> coder);
>>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> - org.apache.beam.fn.harness.data.BeamFnDataClient
>>>>>
>>>>> public interface BeamFnDataClient {
>>>>>>InboundDataClient receive(ApiServiceDescriptor
>>>>>> apiServiceDescriptor, LogicalEndpoint inputLocation,
>>>>>> Coder> coder, FnDataReceiver> 
>>>>>> receiver);
>>>>>>CloseableFnDataReceiver>
>>>>>> send(BeamFnDataGrpcClient Endpoints.ApiServiceDescriptor
>>>>>> apiServiceDescriptor, LogicalEndpoint outputLocation,
>>>>>> Coder> coder);
>>>>>> }
>>>>>
>>>>>
>>>>> Both `Coder>` and `FnDataReceiver>`
>>>>> use `WindowedValue` as the data structure that both sides of Runner and 
>>>>> SDK
>>>>> Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
>>>>> highly abstraction, such as Control Plane and Logging, these are common
>>>>> requirements for all multi-language platforms. For example, the Flink
>>>>> community is also discussing how to support Python UDF, as well as how to
>>>>> deal with docker environment. how to data transfer, how to state access,
>>>>> how to logging etc. If Beam can further abstract these service interfaces,
>>>>> i.e., interface definitions are compatible with multiple engines, and
>>>>> finally provided to other projects in the form of class libraries, it
>>>>> definitely will help other platforms that want to support multiple
>>>>> languages. So could beam can further abstract the interface definition of
>>>>> FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
>>>>> a whale, take the FnDataService#receive interface as an example, and turn
>>>>> `WindowedValue` into `T` so that other platforms can be extended
>>>>> arbitrarily, as follows:
>>>>>
>>>>>  InboundDataClient receive(LogicalEndpoint inputLocation, Coder
>>>>> coder, FnDataReceiver> listener);
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Feel free to correct me if there any incorrect understanding. And
>>>>> welcome any feedback!
>>>>>
>>>>>
>>>>> Regards,
>>>>> Jincheng
>>>>>
>>>>

[DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread jincheng sun

Hi Beam devs,

I read some of the docs about `Communicating over the Fn API` in Beam. I
feel that Beam has a very good design for Control Plane/Data Plane/State
Plane/Logging services, and it is described in  document. When communicating between Runner and SDK Harness, the
DataPlane API will be WindowedValue(An immutable triple of value,
timestamp, and windows.) As a contract object between Runner and SDK
Harness. I see the interface definitions for sending and receiving data in
the code as follows:

- org.apache.beam.runners.fnexecution.data.FnDataService

public interface FnDataService {
>InboundDataClient receive(LogicalEndpoint inputLocation,
> Coder> coder, FnDataReceiver> listener);
>CloseableFnDataReceiver> send(
>   LogicalEndpoint outputLocation, Coder> coder);
> }



- org.apache.beam.fn.harness.data.BeamFnDataClient

public interface BeamFnDataClient {
>InboundDataClient receive(ApiServiceDescriptor apiServiceDescriptor,
> LogicalEndpoint inputLocation, Coder> coder,
> FnDataReceiver> receiver);
>CloseableFnDataReceiver> send(BeamFnDataGrpcClient
> Endpoints.ApiServiceDescriptor apiServiceDescriptor, LogicalEndpoint
> outputLocation, Coder> coder);
> }


Both `Coder>` and `FnDataReceiver>` use
`WindowedValue` as the data structure that both sides of Runner and SDK
Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
highly abstraction, such as Control Plane and Logging, these are common
requirements for all multi-language platforms. For example, the Flink
community is also discussing how to support Python UDF, as well as how to
deal with docker environment. how to data transfer, how to state access,
how to logging etc. If Beam can further abstract these service interfaces,
i.e., interface definitions are compatible with multiple engines, and
finally provided to other projects in the form of class libraries, it
definitely will help other platforms that want to support multiple
languages. So could beam can further abstract the interface definition of
FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
a whale, take the FnDataService#receive interface as an example, and turn
`WindowedValue` into `T` so that other platforms can be extended
arbitrarily, as follows:

 InboundDataClient receive(LogicalEndpoint inputLocation, Coder
coder, FnDataReceiver> listener);

What do you think?

Feel free to correct me if there any incorrect understanding. And welcome
any feedback!


Regards,
Jincheng

77 matches

Mail list logo