[NEED HELP] PMC only finalization items for release 2.30.0

2021-06-09 Thread Heejong Lee
Hi,

I'm finishing 2.30.0 release and need help doing PMC only finalization
items in the release guide (
https://beam.apache.org/contribute/release-guide/#10-finalize-the-release).
Please let me know if any PMC members have some time to do these tasks :)

Thanks!


Re: [VOTE] Release 2.30.0, release candidate #1

2021-06-08 Thread Heejong Lee
I'm happy to announce that we have unanimously approved this release.

There are 5 approving votes:

Binding:
* Chamikara Jayalath
* Ahmet Altay
* Robert Bradshaw
* Kenneth Knowles

Non-binding:
* Tomo Suzuki

There are no disapproving votes. I will finalize the release.

Thanks everyone!

On Tue, Jun 8, 2021 at 12:45 PM Kenneth Knowles  wrote:

> +1 (binding)
>
> Verified wordcount with various configuration parameters and they all
> worked. Particularly confirming that all the containers are chosen
> correctly.
>
> Kenn
>
> On Tue, Jun 8, 2021 at 12:40 PM Heejong Lee  wrote:
>
>>
>>
>> On Tue, Jun 8, 2021 at 12:06 PM Kenneth Knowles  wrote:
>>
>>> I have some configurations that did not work properly in 2.29.0 that I'd
>>> like to verify against this RC. Sorry I haven't got a chance to verify yet,
>>> but can you please wait for that? I am doing the verification right now.
>>>
>>
>> Sure. Please let me know when you finish the validation.
>>
>>
>>>
>>> Kenn
>>>
>>> On Mon, Jun 7, 2021 at 4:28 PM Heejong Lee  wrote:
>>>
>>>> FYI, we now have three binding votes and I will close the vote tomorrow
>>>> morning.
>>>>
>>>> The RC build is validated for most quickstart and mobile gaming
>>>> examples except Flink / Spark runner on YARN and standalone custers (more
>>>> details in the spreadsheet[9] from the original announcement).
>>>>
>>>> On Thu, Jun 3, 2021 at 6:12 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> Verified the signatures are all good and the source tarball matches
>>>>> github.
>>>>>
>>>>> On Thu, Jun 3, 2021 at 3:38 PM Ahmet Altay  wrote:
>>>>> >
>>>>> > +1 (binding) - I ran python quickstart examples on the direct runner.
>>>>> >
>>>>> > Thank you for preparing the RC!
>>>>> >
>>>>> > Ahmet
>>>>> >
>>>>> > On Thu, Jun 3, 2021 at 2:58 PM Chamikara Jayalath <
>>>>> chamik...@google.com> wrote:
>>>>> >>
>>>>> >> +1 (binding)
>>>>> >>
>>>>> >> Tested some Java quickstart validations and multi-language
>>>>> pipelines.
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Cham
>>>>> >>
>>>>> >> On Thu, Jun 3, 2021 at 2:03 PM Tomo Suzuki 
>>>>> wrote:
>>>>> >>>
>>>>> >>> +1 (non-binding)
>>>>> >>>
>>>>> >>> Thank you for the preparation. With the GCP dependencies of my
>>>>> interest, the GitHub checks worked.
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Thu, Jun 3, 2021 at 4:55 AM Heejong Lee 
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Hi everyone,
>>>>> >>>>
>>>>> >>>> Please review and vote on the release candidate #1 for the
>>>>> version 2.30.0, as follows:
>>>>> >>>> [ ] +1, Approve the release
>>>>> >>>> [ ] -1, Do not approve the release (please provide specific
>>>>> comments)
>>>>> >>>>
>>>>> >>>> Reviewers are encouraged to test their own use cases with the
>>>>> release candidate, and vote +1 if no issues are found.
>>>>> >>>>
>>>>> >>>> The complete staging area is available for your review, which
>>>>> includes:
>>>>> >>>> * JIRA release notes [1],
>>>>> >>>> * the official Apache source release to be deployed to
>>>>> dist.apache.org [2], which is signed with the key with fingerprint
>>>>> DBC03F1CCF4240FBD0F256F054550BE0F4C0A24D [3],
>>>>> >>>> * all artifacts to be deployed to the Maven Central Repository
>>>>> [4],
>>>>> >>>> * source code tag "v2.30.0-RC1" [5],
>>>>> >>>> * website pull request listing the release [6], publishing the
>>>>> API reference manual [7], and the blog post [8].
>>>>> >>>> * Java artifacts were built with Maven 3.6.3 and OpenJDK
>>>>> 1.8.0_292.
>>>>> >>>> * Python artifacts are deployed along with the source release to
>>>>> the dist.apache.org [2].
>>>>> >>>> * Validation sheet with a tab for 2.30.0 release to help with
>>>>> validation [9].
>>>>> >>>> * Docker images published to Docker Hub [10].
>>>>> >>>> * Python artifacts are published to pypi as a pre-release version
>>>>> [11].
>>>>> >>>>
>>>>> >>>> The vote will be open for at least 72 hours. It is adopted by
>>>>> majority approval, with at least 3 PMC affirmative votes.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> Heejong
>>>>> >>>>
>>>>> >>>> [1]
>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349978
>>>>> >>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.30.0/
>>>>> >>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>>> >>>> [4]
>>>>> https://repository.apache.org/content/repositories/orgapachebeam-1174/
>>>>> >>>> [5] https://github.com/apache/beam/tree/v2.30.0-RC1
>>>>> >>>> [6] https://github.com/apache/beam/pull/14894
>>>>> >>>> [7] https://github.com/apache/beam-site/pull/613
>>>>> >>>> [8] https://github.com/apache/beam/pull/14895
>>>>> >>>> [9]
>>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=109662250
>>>>> >>>> [10] https://hub.docker.com/search?q=apache%2Fbeam=image
>>>>> >>>> [11] https://pypi.org/project/apache-beam/2.30.0rc1/
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Regards,
>>>>> >>> Tomo
>>>>>
>>>>


Re: [VOTE] Release 2.30.0, release candidate #1

2021-06-08 Thread Heejong Lee
On Tue, Jun 8, 2021 at 12:06 PM Kenneth Knowles  wrote:

> I have some configurations that did not work properly in 2.29.0 that I'd
> like to verify against this RC. Sorry I haven't got a chance to verify yet,
> but can you please wait for that? I am doing the verification right now.
>

Sure. Please let me know when you finish the validation.


>
> Kenn
>
> On Mon, Jun 7, 2021 at 4:28 PM Heejong Lee  wrote:
>
>> FYI, we now have three binding votes and I will close the vote tomorrow
>> morning.
>>
>> The RC build is validated for most quickstart and mobile gaming examples
>> except Flink / Spark runner on YARN and standalone custers (more details in
>> the spreadsheet[9] from the original announcement).
>>
>> On Thu, Jun 3, 2021 at 6:12 PM Robert Bradshaw 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Verified the signatures are all good and the source tarball matches
>>> github.
>>>
>>> On Thu, Jun 3, 2021 at 3:38 PM Ahmet Altay  wrote:
>>> >
>>> > +1 (binding) - I ran python quickstart examples on the direct runner.
>>> >
>>> > Thank you for preparing the RC!
>>> >
>>> > Ahmet
>>> >
>>> > On Thu, Jun 3, 2021 at 2:58 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >>
>>> >> +1 (binding)
>>> >>
>>> >> Tested some Java quickstart validations and multi-language pipelines.
>>> >>
>>> >> Thanks,
>>> >> Cham
>>> >>
>>> >> On Thu, Jun 3, 2021 at 2:03 PM Tomo Suzuki 
>>> wrote:
>>> >>>
>>> >>> +1 (non-binding)
>>> >>>
>>> >>> Thank you for the preparation. With the GCP dependencies of my
>>> interest, the GitHub checks worked.
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Jun 3, 2021 at 4:55 AM Heejong Lee 
>>> wrote:
>>> >>>>
>>> >>>> Hi everyone,
>>> >>>>
>>> >>>> Please review and vote on the release candidate #1 for the version
>>> 2.30.0, as follows:
>>> >>>> [ ] +1, Approve the release
>>> >>>> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>> >>>>
>>> >>>> Reviewers are encouraged to test their own use cases with the
>>> release candidate, and vote +1 if no issues are found.
>>> >>>>
>>> >>>> The complete staging area is available for your review, which
>>> includes:
>>> >>>> * JIRA release notes [1],
>>> >>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> DBC03F1CCF4240FBD0F256F054550BE0F4C0A24D [3],
>>> >>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> >>>> * source code tag "v2.30.0-RC1" [5],
>>> >>>> * website pull request listing the release [6], publishing the API
>>> reference manual [7], and the blog post [8].
>>> >>>> * Java artifacts were built with Maven 3.6.3 and OpenJDK 1.8.0_292.
>>> >>>> * Python artifacts are deployed along with the source release to
>>> the dist.apache.org [2].
>>> >>>> * Validation sheet with a tab for 2.30.0 release to help with
>>> validation [9].
>>> >>>> * Docker images published to Docker Hub [10].
>>> >>>> * Python artifacts are published to pypi as a pre-release version
>>> [11].
>>> >>>>
>>> >>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Heejong
>>> >>>>
>>> >>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349978
>>> >>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.30.0/
>>> >>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> >>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1174/
>>> >>>> [5] https://github.com/apache/beam/tree/v2.30.0-RC1
>>> >>>> [6] https://github.com/apache/beam/pull/14894
>>> >>>> [7] https://github.com/apache/beam-site/pull/613
>>> >>>> [8] https://github.com/apache/beam/pull/14895
>>> >>>> [9]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=109662250
>>> >>>> [10] https://hub.docker.com/search?q=apache%2Fbeam=image
>>> >>>> [11] https://pypi.org/project/apache-beam/2.30.0rc1/
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Regards,
>>> >>> Tomo
>>>
>>


Re: [VOTE] Release 2.30.0, release candidate #1

2021-06-07 Thread Heejong Lee
FYI, we now have three binding votes and I will close the vote tomorrow
morning.

The RC build is validated for most quickstart and mobile gaming examples
except Flink / Spark runner on YARN and standalone custers (more details in
the spreadsheet[9] from the original announcement).

On Thu, Jun 3, 2021 at 6:12 PM Robert Bradshaw  wrote:

> +1 (binding)
>
> Verified the signatures are all good and the source tarball matches github.
>
> On Thu, Jun 3, 2021 at 3:38 PM Ahmet Altay  wrote:
> >
> > +1 (binding) - I ran python quickstart examples on the direct runner.
> >
> > Thank you for preparing the RC!
> >
> > Ahmet
> >
> > On Thu, Jun 3, 2021 at 2:58 PM Chamikara Jayalath 
> wrote:
> >>
> >> +1 (binding)
> >>
> >> Tested some Java quickstart validations and multi-language pipelines.
> >>
> >> Thanks,
> >> Cham
> >>
> >> On Thu, Jun 3, 2021 at 2:03 PM Tomo Suzuki  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> Thank you for the preparation. With the GCP dependencies of my
> interest, the GitHub checks worked.
> >>>
> >>>
> >>>
> >>> On Thu, Jun 3, 2021 at 4:55 AM Heejong Lee  wrote:
> >>>>
> >>>> Hi everyone,
> >>>>
> >>>> Please review and vote on the release candidate #1 for the version
> 2.30.0, as follows:
> >>>> [ ] +1, Approve the release
> >>>> [ ] -1, Do not approve the release (please provide specific comments)
> >>>>
> >>>> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found.
> >>>>
> >>>> The complete staging area is available for your review, which
> includes:
> >>>> * JIRA release notes [1],
> >>>> * the official Apache source release to be deployed to
> dist.apache.org [2], which is signed with the key with fingerprint
> DBC03F1CCF4240FBD0F256F054550BE0F4C0A24D [3],
> >>>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>>> * source code tag "v2.30.0-RC1" [5],
> >>>> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> >>>> * Java artifacts were built with Maven 3.6.3 and OpenJDK 1.8.0_292.
> >>>> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> >>>> * Validation sheet with a tab for 2.30.0 release to help with
> validation [9].
> >>>> * Docker images published to Docker Hub [10].
> >>>> * Python artifacts are published to pypi as a pre-release version
> [11].
> >>>>
> >>>> The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> >>>>
> >>>> Thanks,
> >>>> Heejong
> >>>>
> >>>> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349978
> >>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.30.0/
> >>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>>> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1174/
> >>>> [5] https://github.com/apache/beam/tree/v2.30.0-RC1
> >>>> [6] https://github.com/apache/beam/pull/14894
> >>>> [7] https://github.com/apache/beam-site/pull/613
> >>>> [8] https://github.com/apache/beam/pull/14895
> >>>> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=109662250
> >>>> [10] https://hub.docker.com/search?q=apache%2Fbeam=image
> >>>> [11] https://pypi.org/project/apache-beam/2.30.0rc1/
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Tomo
>


[VOTE] Release 2.30.0, release candidate #1

2021-06-03 Thread Heejong Lee
Hi everyone,

Please review and vote on the release candidate #1 for the version 2.30.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if no issues are found.

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint
DBC03F1CCF4240FBD0F256F054550BE0F4C0A24D [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.30.0-RC1" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Java artifacts were built with Maven 3.6.3 and OpenJDK 1.8.0_292.
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.30.0 release to help with validation
[9].
* Docker images published to Docker Hub [10].
* Python artifacts are published to pypi as a pre-release version [11].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Heejong

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349978
[2] https://dist.apache.org/repos/dist/dev/beam/2.30.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1174/
[5] https://github.com/apache/beam/tree/v2.30.0-RC1
[6] https://github.com/apache/beam/pull/14894
[7] https://github.com/apache/beam-site/pull/613
[8] https://github.com/apache/beam/pull/14895
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=109662250
[10] https://hub.docker.com/search?q=apache%2Fbeam=image
[11] https://pypi.org/project/apache-beam/2.30.0rc1/


[NEED HELP] Populating the change list for 2.30.0 release

2021-05-27 Thread Heejong Lee
Hi Beam developers,

I'm gathering the information for the changes in the 2.30.0 release. If you
have any idea about important *new features* / *breaking changes* /
*deprecation* / *known issues* for the 2.30.0 release, please note down
them in CHANGES.md or just let me know.

Thanks!


Re: Need maintainer permission of PyPI apache-beam package

2021-05-19 Thread Heejong Lee
It's done. Thanks Pablo!

On Wed, May 19, 2021 at 11:21 AM Pablo Estrada  wrote:

> I've sent you an invite to be a project maintainer. Let me know if that
> works.
> Best
> -P.
>
> On Tue, May 18, 2021 at 6:28 PM Heejong Lee  wrote:
>
>> Hi,
>>
>> I'm currently working on Beam 2.30.0 release and need help adding myself
>> to the maintainer group of PyPI apache-beam package. My PyPI username is
>> 'ihji'. Does anybody have permission for adding a new member to the
>> apache-beam maintainer group?
>>
>> Thanks!
>>
>


Need maintainer permission of PyPI apache-beam package

2021-05-18 Thread Heejong Lee
Hi,

I'm currently working on Beam 2.30.0 release and need help adding myself to
the maintainer group of PyPI apache-beam package. My PyPI username is
'ihji'. Does anybody have permission for adding a new member to the
apache-beam maintainer group?

Thanks!


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-05-13 Thread Heejong Lee
UPDATE:

All precommit and postcommit tests are passed now:
https://github.com/apache/beam/pull/14711

We only have one open issue for Fix Version 2.30.0:
https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0

I will start building the RC release after we cherry-pick the last blocker.

On Thu, Apr 29, 2021 at 12:48 AM Heejong Lee  wrote:

> We have 10 open issues for Fix Version 2.30.0:
> https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0
>
> On Thu, Apr 29, 2021 at 12:30 AM Heejong Lee  wrote:
>
>> FYI, I just cut the 2.30.0 release branch. From now on, late commits for
>> 2.30.0 need to be cherry-picked. If you have any late commits, please make
>> sure that their Jira issues have the correct Fix Version, 2.30.0.
>>
>> On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles  wrote:
>>
>>> SGTM. Thanks!
>>>
>>> On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee  wrote:
>>>
>>>>
>>>>
>>>> On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> Confirming that the cut date is 4/28/2021 (in two days), right?
>>>>>
>>>>
>>>> Yes, 2.30.0 branch is scheduled to be cut on April 28.
>>>>
>>>>
>>>>>
>>>>> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki 
>>>>> wrote:
>>>>> >
>>>>> > Thank you for the preparation!
>>>>> >
>>>>> > > a few responses that some high priority changes
>>>>> >
>>>>> > Would you be willing to share the items for visibility?
>>>>>
>>>>> There are several PRs in flight (or recently merged) to get
>>>>> portability working well with Dataflow for this release.
>>>>>
>>>>
>>>> We can still cherry-pick them by importance after the branch cut.
>>>>
>>>>
>>>>>
>>>>> >
>>>>> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
>>>>> wrote:
>>>>> > >
>>>>> > > Also the 2.29.0 was re-cut.
>>>>> > >
>>>>> > > Usually a delay in one release should not delay the next release,
>>>>> because each release represents a certain quantity of changes. But in this
>>>>> case, the actual quantity of changes is affected by the re-cut, too.
>>>>> > >
>>>>> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
>>>>> wrote:
>>>>> > >>
>>>>> > >> Update on the 2.30.0 branch cut schedule:
>>>>> > >>
>>>>> > >> I'm thinking of delaying the branch cut a week since I've got a
>>>>> few responses that some high priority changes are still ongoing.
>>>>> > >>
>>>>> > >> The new cut date is April 28.
>>>>> > >>
>>>>> > >>
>>>>> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
>>>>> wrote:
>>>>> > >>>
>>>>> > >>> +1 and thank you!
>>>>> > >>>
>>>>> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
>>>>> wrote:
>>>>> > >>>>
>>>>> > >>>> Hi All,
>>>>> > >>>>
>>>>> > >>>> Beam 2.30.0 release is scheduled to be cut on April 21
>>>>> according to the release calendar [1]
>>>>> > >>>>
>>>>> > >>>> I'd like to volunteer myself to be the release manager for this
>>>>> release. I plan on cutting the release branch on the scheduled date.
>>>>> > >>>>
>>>>> > >>>> Any comments or objections ?
>>>>> > >>>>
>>>>> > >>>> Thanks,
>>>>> > >>>> Heejong
>>>>> > >>>>
>>>>> > >>>> [1]
>>>>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> > Tomo
>>>>>
>>>>


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-04-29 Thread Heejong Lee
We have 10 open issues for Fix Version 2.30.0:
https://issues.apache.org/jira/browse/BEAM-12242?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.30.0

On Thu, Apr 29, 2021 at 12:30 AM Heejong Lee  wrote:

> FYI, I just cut the 2.30.0 release branch. From now on, late commits for
> 2.30.0 need to be cherry-picked. If you have any late commits, please make
> sure that their Jira issues have the correct Fix Version, 2.30.0.
>
> On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles  wrote:
>
>> SGTM. Thanks!
>>
>> On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee  wrote:
>>
>>>
>>>
>>> On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> Confirming that the cut date is 4/28/2021 (in two days), right?
>>>>
>>>
>>> Yes, 2.30.0 branch is scheduled to be cut on April 28.
>>>
>>>
>>>>
>>>> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki  wrote:
>>>> >
>>>> > Thank you for the preparation!
>>>> >
>>>> > > a few responses that some high priority changes
>>>> >
>>>> > Would you be willing to share the items for visibility?
>>>>
>>>> There are several PRs in flight (or recently merged) to get
>>>> portability working well with Dataflow for this release.
>>>>
>>>
>>> We can still cherry-pick them by importance after the branch cut.
>>>
>>>
>>>>
>>>> >
>>>> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
>>>> wrote:
>>>> > >
>>>> > > Also the 2.29.0 was re-cut.
>>>> > >
>>>> > > Usually a delay in one release should not delay the next release,
>>>> because each release represents a certain quantity of changes. But in this
>>>> case, the actual quantity of changes is affected by the re-cut, too.
>>>> > >
>>>> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
>>>> wrote:
>>>> > >>
>>>> > >> Update on the 2.30.0 branch cut schedule:
>>>> > >>
>>>> > >> I'm thinking of delaying the branch cut a week since I've got a
>>>> few responses that some high priority changes are still ongoing.
>>>> > >>
>>>> > >> The new cut date is April 28.
>>>> > >>
>>>> > >>
>>>> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
>>>> wrote:
>>>> > >>>
>>>> > >>> +1 and thank you!
>>>> > >>>
>>>> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
>>>> wrote:
>>>> > >>>>
>>>> > >>>> Hi All,
>>>> > >>>>
>>>> > >>>> Beam 2.30.0 release is scheduled to be cut on April 21 according
>>>> to the release calendar [1]
>>>> > >>>>
>>>> > >>>> I'd like to volunteer myself to be the release manager for this
>>>> release. I plan on cutting the release branch on the scheduled date.
>>>> > >>>>
>>>> > >>>> Any comments or objections ?
>>>> > >>>>
>>>> > >>>> Thanks,
>>>> > >>>> Heejong
>>>> > >>>>
>>>> > >>>> [1]
>>>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Regards,
>>>> > Tomo
>>>>
>>>


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-04-29 Thread Heejong Lee
FYI, I just cut the 2.30.0 release branch. From now on, late commits for
2.30.0 need to be cherry-picked. If you have any late commits, please make
sure that their Jira issues have the correct Fix Version, 2.30.0.

On Tue, Apr 27, 2021 at 7:52 AM Kenneth Knowles  wrote:

> SGTM. Thanks!
>
> On Mon, Apr 26, 2021 at 2:33 PM Heejong Lee  wrote:
>
>>
>>
>> On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
>> wrote:
>>
>>> Confirming that the cut date is 4/28/2021 (in two days), right?
>>>
>>
>> Yes, 2.30.0 branch is scheduled to be cut on April 28.
>>
>>
>>>
>>> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki  wrote:
>>> >
>>> > Thank you for the preparation!
>>> >
>>> > > a few responses that some high priority changes
>>> >
>>> > Would you be willing to share the items for visibility?
>>>
>>> There are several PRs in flight (or recently merged) to get
>>> portability working well with Dataflow for this release.
>>>
>>
>> We can still cherry-pick them by importance after the branch cut.
>>
>>
>>>
>>> >
>>> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles 
>>> wrote:
>>> > >
>>> > > Also the 2.29.0 was re-cut.
>>> > >
>>> > > Usually a delay in one release should not delay the next release,
>>> because each release represents a certain quantity of changes. But in this
>>> case, the actual quantity of changes is affected by the re-cut, too.
>>> > >
>>> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
>>> wrote:
>>> > >>
>>> > >> Update on the 2.30.0 branch cut schedule:
>>> > >>
>>> > >> I'm thinking of delaying the branch cut a week since I've got a few
>>> responses that some high priority changes are still ongoing.
>>> > >>
>>> > >> The new cut date is April 28.
>>> > >>
>>> > >>
>>> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay 
>>> wrote:
>>> > >>>
>>> > >>> +1 and thank you!
>>> > >>>
>>> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
>>> wrote:
>>> > >>>>
>>> > >>>> Hi All,
>>> > >>>>
>>> > >>>> Beam 2.30.0 release is scheduled to be cut on April 21 according
>>> to the release calendar [1]
>>> > >>>>
>>> > >>>> I'd like to volunteer myself to be the release manager for this
>>> release. I plan on cutting the release branch on the scheduled date.
>>> > >>>>
>>> > >>>> Any comments or objections ?
>>> > >>>>
>>> > >>>> Thanks,
>>> > >>>> Heejong
>>> > >>>>
>>> > >>>> [1]
>>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Tomo
>>>
>>


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-04-26 Thread Heejong Lee
On Mon, Apr 26, 2021 at 10:24 AM Robert Bradshaw 
wrote:

> Confirming that the cut date is 4/28/2021 (in two days), right?
>

Yes, 2.30.0 branch is scheduled to be cut on April 28.


>
> On Wed, Apr 21, 2021 at 4:41 PM Tomo Suzuki  wrote:
> >
> > Thank you for the preparation!
> >
> > > a few responses that some high priority changes
> >
> > Would you be willing to share the items for visibility?
>
> There are several PRs in flight (or recently merged) to get
> portability working well with Dataflow for this release.
>

We can still cherry-pick them by importance after the branch cut.


>
> >
> > On Wed, Apr 21, 2021 at 7:21 PM Kenneth Knowles  wrote:
> > >
> > > Also the 2.29.0 was re-cut.
> > >
> > > Usually a delay in one release should not delay the next release,
> because each release represents a certain quantity of changes. But in this
> case, the actual quantity of changes is affected by the re-cut, too.
> > >
> > > On Wed, Apr 21, 2021 at 4:12 PM Heejong Lee 
> wrote:
> > >>
> > >> Update on the 2.30.0 branch cut schedule:
> > >>
> > >> I'm thinking of delaying the branch cut a week since I've got a few
> responses that some high priority changes are still ongoing.
> > >>
> > >> The new cut date is April 28.
> > >>
> > >>
> > >> On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay  wrote:
> > >>>
> > >>> +1 and thank you!
> > >>>
> > >>> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee 
> wrote:
> > >>>>
> > >>>> Hi All,
> > >>>>
> > >>>> Beam 2.30.0 release is scheduled to be cut on April 21 according to
> the release calendar [1]
> > >>>>
> > >>>> I'd like to volunteer myself to be the release manager for this
> release. I plan on cutting the release branch on the scheduled date.
> > >>>>
> > >>>> Any comments or objections ?
> > >>>>
> > >>>> Thanks,
> > >>>> Heejong
> > >>>>
> > >>>> [1]
> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
> >
> >
> >
> > --
> > Regards,
> > Tomo
>


Re: [PROPOSAL] Preparing for Beam 2.30.0 release

2021-04-21 Thread Heejong Lee
Update on the 2.30.0 branch cut schedule:

I'm thinking of delaying the branch cut a week since I've got a few
responses that some high priority changes are still ongoing.

The new cut date is April 28.


On Tue, Apr 20, 2021 at 6:07 PM Ahmet Altay  wrote:

> +1 and thank you!
>
> On Tue, Apr 20, 2021 at 4:55 PM Heejong Lee  wrote:
>
>> Hi All,
>>
>> Beam 2.30.0 release is scheduled to be cut on April 21 according to the
>> release calendar [1]
>>
>> I'd like to volunteer myself to be the release manager for this release.
>> I plan on cutting the release branch on the scheduled date.
>>
>> Any comments or objections ?
>>
>> Thanks,
>> Heejong
>>
>> [1]
>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles
>>
>


[PROPOSAL] Preparing for Beam 2.30.0 release

2021-04-20 Thread Heejong Lee
Hi All,

Beam 2.30.0 release is scheduled to be cut on April 21 according to the
release calendar [1]

I'd like to volunteer myself to be the release manager for this release. I
plan on cutting the release branch on the scheduled date.

Any comments or objections ?

Thanks,
Heejong

[1]
https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com=America/Los_Angeles


Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-10 Thread Heejong Lee
Congratulations!

On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw  wrote:

> Thank you and welcome, Reza!
>
> On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:
>
>> Congratulations Reza! And thank you for your contributions!
>>
>> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats Reza!
>>>
>>> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Reza Ardeshir Rokni.

 Reza has been part of the Beam community since 2017! Reza has
 spearheaded advanced Beam examples [1], blogged and presented at multiple
 Beam Summits. Reza helps out users on the mailing lists [2] and
 StackOverflow [3]. When Reza's work uncovers a missing feature in Beam, he
 adds it [4]. Considering these contributions, the Beam PMC trusts Reza with
 the responsibilities of a Beam committer [5].

 Thank you, Reza, for your contributions.

 Kenn

 [1] https://github.com/apache/beam/pull/3961
 [2]
 https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
 [3] https://stackoverflow.com/tags/apache-beam/topusers
 [4] https://github.com/apache/beam/pull/11929
 [5]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>


Re: Re-running GitHub Actions jobs

2020-09-03 Thread Heejong Lee
On Thu, Sep 3, 2020 at 11:05 AM Brian Hulette  wrote:

> The new GitHub Actions workflows that run Java and Python tests against
> different targets (macos, ubuntu, windows) are great! But just like our
> Jenkins infra they flake occasionally. Should we be re-running all of these
> jobs until we get green runs?
>
> Unfortunately it's not possible to re-run an individual job in a workflow
> [1], the only option is to re-run all jobs, so flaky tests become even more
> problematic.
>
> I see two options:
> 1) Consider it "good enough" if just Jenkins CI passes and any GitHub
> actions failures appear to be flakes.
> 2) Require that all Jenkins and GitHub checks pass.
>
> My vote is for (2). (1) risks merging legitimate breakages, and one could
> argue that making flaky tests extra painful is a good thing. Also we can
> always make an exception if an obvious flake is blocking a critical PR.
>

+1 for (2) given that it might be not so easy to figure out whether the
failure is flaky (or how critical it is).
BTW, I see it's impossible to re-run a specific test but how do we re-run
all tests then? Is there a menu item for it or needs to force update the
commits?


>
>
> Also FYI - at first I thought these workflows only had the stdout
> available, but the test report directory is also zipped and uploaded as an
> artifact. When a failure occurs you can download it to get the full output:
> [image: image.png]
>
>
> Brian
>
> [1]
> https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234
>


Re: [VOTE] Release 2.23.0, release candidate #1

2020-07-15 Thread Heejong Lee
I think we need to cherry-pick
https://issues.apache.org/jira/browse/BEAM-10397 which fixes missing
environment errors for Dataflow xlang pipelines. Internally, we have a
flaky xlang kafkaio test because of missing environment errors and any
xlang pipelines using GroupByKey could encounter this.

On Wed, Jul 15, 2020 at 5:08 PM Ahmet Altay  wrote:

>
>
> On Wed, Jul 15, 2020 at 4:55 PM Robert Bradshaw 
> wrote:
>
>> All the artifacts, signatures, and hashes look good.
>>
>> I would like to understand the severity of
>> https://issues.apache.org/jira/browse/BEAM-10397 before giving my
>> vote.
>>
>
> +Heejong Lee  to comment on this.
>
>
>>
>> On Wed, Jul 15, 2020 at 10:51 AM Pablo Estrada 
>> wrote:
>> >
>> > +1
>> > I was able to run the python 3.8 quickstart from wheels on DirectRunner.
>> > I verified hashes for Python files.
>> > -P.
>> >
>> > On Fri, Jul 10, 2020 at 4:34 PM Ahmet Altay  wrote:
>> >>
>> >> I validated the python 3 quickstarts. I had issues with running with
>> python 3.8 wheel files, but did not have issues with source distributions,
>> or other python wheel files. I have not tested python 2 quickstarts.
>>
>
> Did someone validate python 3.8 wheels on Dataflow? I was not able to run
> that.
>
>
>> >>
>> >> On Thu, Jul 9, 2020 at 10:53 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> Please review and vote on the release candidate #1 for the version
>> 2.23.0, as follows:
>> >>> [ ] +1, Approve the release
>> >>> [ ] -1, Do not approve the release (please provide specific comments)
>> >>>
>> >>>
>> >>> The complete staging area is available for your review, which
>> includes:
>> >>> * JIRA release notes [1],
>> >>> * the official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with fingerprint
>> 1DF50603225D29A4 [3],
>> >>> * all artifacts to be deployed to the Maven Central Repository [4],
>> >>> * source code tag "v2.23.0-RС1" [5],
>> >>> * website pull request listing the release [6], publishing the API
>> reference manual [7], and the blog post [8].
>> >>> * Java artifacts were built with Maven 3.6.0 and Oracle JDK
>> 1.8.0_201-b09 .
>> >>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>> >>> * Validation sheet with a tab for 2.23.0 release to help with
>> validation [9].
>> >>> * Docker images published to Docker Hub [10].
>> >>>
>> >>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>> >>>
>> >>> Thanks,
>> >>> Release Manager
>> >>>
>> >>> [1]
>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12347145
>> >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.23.0/
>> >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> >>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1105/
>> >>> [5] https://github.com/apache/beam/tree/v2.23.0-RC1
>> >>> [6] https://github.com/apache/beam/pull/12212
>> >>> [7] https://github.com/apache/beam-site/pull/605
>> >>> [8] https://github.com/apache/beam/pull/12213
>> >>> [9]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=596347973
>> >>> [10] https://hub.docker.com/search?q=apache%2Fbeam=image
>>
>


Re: XLang sub-graph representation within the SDKs pipeline types

2020-07-02 Thread Heejong Lee
On Wed, Jul 1, 2020 at 7:18 PM Robert Burke  wrote:

> From the Go SDK side, it was built that way nearly from the start.
> Historically there was a direct SDK rep -> Dataflow rep conversion, but
> that's been replaced with a SDK rep -> Beam Proto -> Dataflow rep
> conversion.
>
> In particular, this approach had a few benefits: easier to access local
> context for pipeline validation at construction time, to permit as early a
> failure as possible, which might be easier with native language constructs
> vs beam representations of them.(Eg. DoFns not matching ParDo & Collection
> types, and similar)
> Protos are convenient, but impose certain structure on how the pipeline
> graph is handled. (This isn't to say an earlier conversion isn't possible,
> one can do almost anything in code, but it lets the structure be optimised
> for this case.)
>
> The big advantage of translating from Beam proto -> to Dataflow Rep is
> that the Dataflow Rep can get the various unique IDs that are mandated for
> the Beam proto process.
>
> However, the same can't really be said for the other way around.  A good
> question is "when should the unique IDs be assigned?"
>

This is very true and I would like to elaborate more on the source of
friction when using external transforms. As Robert mentioned, pipeline
proto refers to each component by unique IDs and the unique ID is only
assigned when we convert SDK pipeline object to pipeline proto. Before
XLang, pipeline object to pipeline proto conversion happened one time
during the job submission phase. However, after XLang transform was
introduced, it also happens when we request expansion of external
transforms to the expansion service. Unique ID generated for the expansion
request can be embedded in the returning external proto and conflicted
later with other unique IDs generated for the job submission.


>
> While I'm not working on adding XLang to the Go SDK directly (that would
> be our wonderful intern, Kevin),  I've kind of pictured that the process
> was to provide the Expansion service with unique placeholders if unable to
> provide the right IDs, and substitute them in returned pipeline graph
> segment afterwards, once that is known. That is, we can be relatively
> certain that the expansion service will be self consistent, but it's the
> SDK requesting the expansion's responsibility to ensure they aren't
> colliding with the primary SDKs pipeline ids.
>

AFAIK, we're already doing this in Java and Python SDKs. Not providing a
"placeholder" but remembering which pipeline object maps to which unique ID
used in the expanded component proto.


>
> Otherwise, we could probably recommend a translation protocol (if one
> doesn't exist already, it probably does) and when XLang expansions are to
> happen in the SDK -> beam proto process. So something like Pass 1, intern
> all coders and Pcollections, Pass 2 intern all DoFns and environments, Pass
> 3 expand Xlang, ... Etc.
>

Not sure I understand correctly but a following transform who consumes the
output of an external transform needs some information like the output
pcollection information from the expanded external transform during the
pipeline construction phase.


> The other half of this is when happens when Going from Beam proto a
> -> SDK? This happens during pipeline execution, but at least in the Go SDK
> partly happens when creating the Dataflow rep. In particular, Coder
> reference values only have a populated ID when they've been "rehydrated"
> from the Beam proto, since the Beam Proto is the first place where such IDs
> are correctly assigned.
>
> Tl;dr; i think the right question to sort out is when should IDs be
> expected to be assigned and available during pipeline construction.
>
> On Wed, Jul 1, 2020, 6:34 PM Luke Cwik  wrote:
>
>> It seems like we keep running into translation issues with XLang due to
>> how it is represented in the SDK. (e.g. Brian's work on context map due to
>> loss of coder ids, Heejong's work related to missing environment ids on
>> windowing strategies).
>>
>> I understand that there is an effort that is Dataflow specific where the
>> conversion of the Beam proto -> Dataflow API (v1b3) will help with some
>> issues but it still requires the SDK pipeline representation -> Beam proto
>> to occur correctly which won't be fixed by the Dataflow specific effort.
>>
>> Why did we go with the current approach?
>>
>> What other ways could we do this?
>>
>


Re: Beam Jenkins Migration

2020-06-18 Thread Heejong Lee
This is awesome. Could non-committers also trigger the test now?

On Wed, Jun 17, 2020 at 6:12 AM Damian Gadomski 
wrote:

> Hello,
>
> Good news, we've just migrated to the new CI: https://ci-beam.apache.org.
> As from now beam projects at builds.apache.org are disabled.
>
> If you experience any issues with the new setup please let me know, either
> here or on ASF slack.
>
> Regards,
> Damian
>
> On Mon, Jun 15, 2020 at 10:40 PM Damian Gadomski <
> damian.gadom...@polidea.com> wrote:
>
>> Happy to see your positive response :)
>>
>> @Udi Meiri, Thanks for pointing that out. I've checked it and indeed it
>> needs some attention.
>>
>> There are two things basing on my research:
>>
>>- data uploaded by performance and load tests by the jobs, directly
>>to the influx DB - that should be handled automatically as new jobs will
>>upload the same data in the same way
>>- data fetched using Jenkins API by the metrics tool (syncjenkins.py)
>>- here the situation is a bit more complex as the script relies on the
>>build number (it's used actually as a time reference and primary key in 
>> the
>>DB is created from it). To avoid refactoring of the script and database
>>migration to use timestamp instead of build number I've just
>>"fast-forwarded" the numbers on the new https://ci-beam.apache.org to
>>follow current numbering from the old CI. Therefore simple replacement of
>>the Jenkins URL in the metrics scripts should do the trick to have
>>continuous metrics data. I'll check that tomorrow on my local grafana
>>instance.
>>
>> Please let me know if there's anything that I missed.
>>
>> Regards,
>> Damian
>>
>> On Mon, Jun 15, 2020 at 8:05 PM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Great! Thank you for working on this and letting us know.
>>>
>>> On 12 Jun 2020, at 16:58, Damian Gadomski 
>>> wrote:
>>>
>>> Hello,
>>>
>>> During the last few days, I was preparing for the Beam Jenkins migration
>>> from builds.apache.org to ci-beam.apache.org. The new Jenkins Master
>>> will be dedicated only for Beam related jobs, all Beam Committers will have
>>> build configure access, and Beam PMC will have Admin (GUI) Access.
>>>
>>> We (in cooperation with Infra) are almost ready for the migration itself
>>> and I want to share with you the details of our plan. We are planning to
>>> start the migration next week, most likely on Tuesday. I'll keep you
>>> updated on the progress. We do not expect any issues nor the outage of the
>>> CI services, everything should be more or less unnoticeable. Just don't be
>>> surprised that the Jenkins URL will change to https://ci-beam.apache.org
>>>
>>> If you are curious, here are the steps that we are going to take:
>>>
>>> 1. Create 16 new CI nodes that will be connected to the new CI. We will
>>> then have simultaneously running two CI servers.
>>> 2. Verify that new builds work as expected on the new instance (compare
>>> results of cron builds). (a day or two would be sufficient)
>>> 3. Move the responsibility of Phrase/PR/Commit builds to the new CI,
>>> disable on the old one.
>>> 4. Modify the .test-infra/jenkins/README.md to point to the new instance
>>> and replace Post-commit tests status in README.md and
>>> .github/PULL_REQUEST_TEMPLATE.md
>>> 5. Disable the jobs on the old Jenkins and add a description to each job
>>> with the URL to the corresponding one on the new CI.
>>> 6. Turn off VM instances of the old nodes.
>>> 7. Remove VM instances of the old nodes.
>>>
>>> In case of any questions or doubts feel free to ask :)
>>>
>>> Regards,
>>> Damian
>>>
>>>
>>>


Re: Python SDK ReadFromKafka: Timeout expired while fetching topic metadata

2020-06-08 Thread Heejong Lee
DirectRunner is not well-tested for xlang transforms and you need to
specify jar_packages experimental flag for Java dependencies from Python
SDK. I'd recommend using 2.22 + FlinkRunner for xlang pipelines.

On Mon, Jun 8, 2020 at 3:27 PM Chamikara Jayalath 
wrote:

> To clarify, Kafka dependency was already available as an embedded
> dependency in Java SDK Harness but not sure if this worked for
> DirectRunner. starting 2.22 we'll be staging dependencies from the
> environment during pipeline submission.
>
> On Mon, Jun 8, 2020 at 3:23 PM Chamikara Jayalath 
> wrote:
>
>> Seems like Java dependency is not being properly set up when running the
>> cross-language Kafka step. I don't think this was available for Beam 2.21.
>> Can you try with the latest Beam HEAD or Beam 2.22 when it's released ?
>> +Heejong Lee 
>>
>> On Mon, Jun 8, 2020 at 12:39 PM Piotr Filipiuk 
>> wrote:
>>
>>> Pasting the error inline:
>>>
>>> ERROR:root:severity: ERROR
>>> timestamp {
>>>   seconds: 1591405163
>>>   nanos: 81500
>>> }
>>> message: "Client failed to dequeue and process the value"
>>> trace: "org.apache.beam.sdk.util.UserCodeException:
>>> java.lang.NoClassDefFoundError:
>>> org/springframework/expression/EvaluationContext\n\tat
>>> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)\n\tat
>>> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$DoFnInvoker.invokeGetSize(Unknown
>>> Source)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner$2.outputWithTimestamp(FnApiDoFnRunner.java:497)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.output(FnApiDoFnRunner.java:1335)\n\tat
>>> org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:75)\n\tat
>>> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn.splitRestriction(Read.java:504)\n\tat
>>> org.apache.beam.sdk.io.Read$UnboundedSourceAsSDFWrapperFn$DoFnInvoker.invokeSplitRestriction(Unknown
>>> Source)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForSplitRestriction(FnApiDoFnRunner.java:715)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:874)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForPairWithRestriction(FnApiDoFnRunner.java:688)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:874)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.access$600(FnApiDoFnRunner.java:121)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:1340)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.output(FnApiDoFnRunner.java:1335)\n\tat
>>> org.apache.beam.sdk.transforms.DoFnOutputReceivers$WindowedContextOutputReceiver.output(DoFnOutputReceivers.java:75)\n\tat
>>> org.apache.beam.sdk.transforms.MapElements$1.processElement(MapElements.java:139)\n\tat
>>> org.apache.beam.sdk.transforms.MapElements$1$DoFnInvoker.invokeProcessElement(Unknown
>>> Source)\n\tat
>>> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:672)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)\n\tat
>>> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)\n\tat
>>> org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:177)\n\tat
>>> org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106)\n\tat
>>> org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:294)\n\tat
>>> org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)\n\tat
>>> org.apache.beam.fn.harness.control.

Re: Jira components for cross-language transforms

2020-05-28 Thread Heejong Lee
If we use one meta component tag for all xlang related issues, I would
prefer just "xlang". Then we could attach the "xlang" tag to not only
language specific sdk tags but also other runner tags e.g. ['xlang',
'io-java-kafka'], ['xlang'', 'runner-dataflow'].

On Thu, May 28, 2020 at 7:49 PM Robert Burke  wrote:

> +1 to new component not split. The language concerns can be represented
> and filtered with the existing sdk tags. I know I'm interested in all
> sdk-go issues, and would prefer not to have to union tags when searching
> for Go related issues.
>
> On Thu, 28 May 2020 at 15:48, Ismaël Mejía  wrote:
>
>> +1 to new component not splitted
>>
>> Other use case is using libraries not available in your language e.g.
>> using some python transform that relies in a python only API in the middle
>> of a Java pipeline.
>>
>>
>> On Thu, May 28, 2020 at 11:12 PM Chamikara Jayalath 
>> wrote:
>>
>>> I proposed three components since the audience might be different. Also
>>> we can use the same component to track issues related to all cross-language
>>> wrappers available in a given SDK. If this is too much a single component
>>> is fine as well.
>>>
>>> Ashwin, as others pointed out, the cross-language transforms framework
>>> is primarily for giving SDKs access to transforms that are not
>>> available natively. But there are other potential use-cases as well (for
>>> example, using two different Python environments within the same
>>> pipeline).
>>> Exact performance will depend on the runner implementation as well as
>>> the additional cost involved due to serializing/deserializing data across
>>> environment boundaries. But we haven't done enough analysis/benchmarking to
>>> provide more details on this.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:
>>>
 > What are some of the benefits / drawbacks of using cross-language
 transforms? Would a native Python transform perform better than a
 cross-language transform written in Java that is then used in a Python
 pipeline?

 As Rui says, the main advantage is code reuse. See
 https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
 information.

 On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:

> +1 on dedicated components for cross-language transform. It might be
> easy to manage to have one component (one tag for all SDK) rather than
> multiple ones.
>
>
> Re Ashwin,
>
> Cham knows more than me. AFAIK, cross-language transforms will
> maximize code reuse for newly developed SDK (e.g. IO transforms for Go
> SDK). Of course, a SDK can develop its own IOs, but it's lots of work.
>
>
> -Rui
>
> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami <
> aramaswa...@gmail.com> wrote:
>
>> What are some of the benefits / drawbacks of using cross-language
>> transforms? Would a native Python transform perform better than a
>> cross-language transform written in Java that is then used in a Python
>> pipeline?
>>
>> Ashwin Ramaswami
>> Student
>> *Find me on my:* LinkedIn  |
>> Website  | GitHub
>> 
>>
>>
>> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver 
>> wrote:
>>
>>> SGTM. Though I'm not sure it's necessary to split by language. It
>>> might be easier to use a single cross-language tag, rather than having 
>>> to
>>> tag lots of issues as both sdks-python-xlang and sdks-java-xlang.
>>>
>>> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 Hi All,

 I think it's good if we can have new Jira components to easily
 track various issues related to cross-language transforms.

 What do you think about adding the following Jira components ?

 sdks-python-xlang
 sdks-java-xlang
 sdks-go-xlang

 Jira component sdks-foo-xlang is for tracking issues related to
 cross-language transforms for SDK Foo. For example,
 * Issues related cross-language transforms wrappers written in SDK
 Foo
 * Issues related to transforms implemented in SDK Foo that are
 offered as cross-language transforms to other SDKs
 * Issues related to cross-language transform expansion service
 implemented for SDK Foo

 Thanks,
 Cham

>>>


Re: No space left on device - beam-jenkins 1 and 7

2020-03-11 Thread Heejong Lee
Still seeing no space left on device errors on jenkins-7 (for example:
https://builds.apache.org/job/beam_PreCommit_PythonLint_Commit/2754/)


On Fri, Mar 6, 2020 at 7:11 PM Alan Myrvold  wrote:

> Did a one time cleanup of tmp files owned by jenkins older than 3 days.
> Agree that we need a longer term solution.
>
> Passing recent tests on all executors except jenkins-12, which has not
> scheduled recent builds for the past 13 days. Not scheduling:
> https://builds.apache.org/computer/apache-beam-jenkins-12/builds
> 
> Recent passing builds:
> https://builds.apache.org/computer/apache-beam-jenkins-1/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-2/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-3/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-4/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-5/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-6/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-7/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-8/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-9/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-10/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-11/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-13/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-14/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-15/builds
> 
> https://builds.apache.org/computer/apache-beam-jenkins-16/builds
> 
>
> On Fri, Mar 6, 2020 at 11:54 AM Ahmet Altay  wrote:
>
>> +Alan Myrvold  is doing a one time cleanup. I agree
>> that we need to have a solution to automate this task or address the root
>> cause of the buildup.
>>
>> On Thu, Mar 5, 2020 at 2:47 AM Michał Walenia 
>> wrote:
>>
>>> Hi there,
>>> it seems we have a problem with Jenkins workers again. Nodes 1 and 7
>>> both fail jobs with "No space left on device".
>>> Who is the best person to contact in these cases (someone with access
>>> permissions to the workers).
>>>
>>> I also noticed that such errors are becoming more and more frequent
>>> recently and I'd like to discuss how can this be remedied. Can a cleanup
>>> task be automated on Jenkins somehow?
>>>
>>> Regards
>>> Michal
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>


Re: Error logging from fn_api_runners

2020-03-02 Thread Heejong Lee
I think it should be either info or debug but not error.

On Mon, Mar 2, 2020 at 2:35 PM Ning Kang  wrote:

> Hi,
>
> I just observed some error level loggings like these:
> ```
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ERROR:apache_beam.runners.portability.fn_api_runner:created 1 workers
> {'worker_5':
>  at 0x127fdaa58>}
> ```
> It's coming from this PR
> 
> .
> ```
>
> def get_worker_handlers(
> self,
> environment_id,  # type: Optional[str]
> num_workers  # type: int
> ):
>   # type: (...) -> List[WorkerHandler]
>   if environment_id is None:
> # Any environment will do, pick one arbitrarily.
> environment_id = next(iter(self._environments.keys()))
>   environment = self._environments[environment_id]
>
>   # assume all environments except EMBEDDED_PYTHON use gRPC.
>   if environment.urn == python_urns.EMBEDDED_PYTHON:
> # special case for EmbeddedWorkerHandler: there's no need for a gRPC
> # server, but to pass the type check on WorkerHandler.create() we
> # make like we have a GrpcServer instance.
> self._grpc_server = cast(GrpcServer, None)
>   elif self._grpc_server is None:
> self._grpc_server = GrpcServer(
> self._state, self._job_provision_info, self)
>
>   worker_handler_list = self._cached_handlers[environment_id]
>   if len(worker_handler_list) < num_workers:
> for _ in range(len(worker_handler_list), num_workers):
>   worker_handler = WorkerHandler.create(
>   environment,
>   self._state,
>   self._job_provision_info,
>   self._grpc_server)
>   _LOGGER.info(
>   "Created Worker handler %s for environment %s",
>   worker_handler,
>   environment)
>   self._cached_handlers[environment_id].append(worker_handler)
>   self._workers_by_id[worker_handler.worker_id] = worker_handler
>   worker_handler.start_worker()
>   _LOGGER.error("created %s workers %s", num_workers, self._workers_by_id)
>   return self._cached_handlers[environment_id][:num_workers]
>
> ```
> Is this supposed to be an info level logging?
>
> Thanks!
>
> Ning.
>


Re: Cross-language pipelines status

2020-02-11 Thread Heejong Lee
On Tue, Feb 11, 2020 at 9:37 AM Alexey Romanenko 
wrote:

> Hi all,
>
> I just wanted to ask for more details about the status of cross-language
> pipelines (rather, transforms). I see some discussions about that here, but
> I think it’s more around cross-language IOs.
>
> I’ll appreciate for any information about that topic and answers for these
> questions:
> - Are there any examples/guides of setting up and running cross-languages
> pipelines?
>

AFAIK, there's no official guide for cross-language pipelines. But there
are examples and test cases you can use as reference such as:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py



> - Is this something that already can be used (currently interested in
> Java/Python pipelines) or the main work is still in progress? More
> precisely - I’m more focused in executing some Python code from Java-based
> pipelines.
>

The runner and SDK supports are in working state I could say but not many
IOs expose their cross-language interface yet (you can easily write
cross-language configuration for any Python transforms by yourself though).


> - Is the information here
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
> there any other entry points you can recommend?
>

I think it's up-to-date.


>
> Thanks!


Re: Enabling a new Jenkins job

2020-02-05 Thread Heejong Lee
Fixed. Seed job was overridden by another scheduled seed job. Thanks, Udi!


On Wed, Feb 5, 2020 at 2:04 PM Heejong Lee  wrote:

> I created a new Jenkins job in my PR[1] and the new job shows "This
> project is currently disabled"[2]. Does anybody know how to enable the
> new job?
>
> [1]: https://github.com/apache/beam/pull/10758
> [2]: https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
>


Enabling a new Jenkins job

2020-02-05 Thread Heejong Lee
I created a new Jenkins job in my PR[1] and the new job shows "This project
is currently disabled"[2]. Does anybody know how to enable the new job?

[1]: https://github.com/apache/beam/pull/10758
[2]: https://builds.apache.org/job/beam_PostCommit_XVR_Spark/


Re: [ANNOUNCE] New committer: Hannah Jiang

2020-01-28 Thread Heejong Lee
Congratulations! :)

On Tue, Jan 28, 2020 at 4:43 PM Yichi Zhang  wrote:

> Congrats Hannah!
>
> On Tue, Jan 28, 2020 at 3:57 PM Yifan Zou  wrote:
>
>> Congratulations Hannah!!
>>
>> On Tue, Jan 28, 2020 at 3:55 PM Boyuan Zhang  wrote:
>>
>>> Thanks for all your contributions! Congratulations~
>>>
>>> On Tue, Jan 28, 2020 at 3:44 PM Pablo Estrada 
>>> wrote:
>>>
 yoooho : D

 On Tue, Jan 28, 2020 at 3:21 PM Luke Cwik  wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Hannah Jiang
>
> Hannah has contributed to Beam in many ways, including work on
> building and releasing the Apache Beam SDK containers.
>
> In consideration of their contributions, the Beam PMC trusts them with
> the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Hannah!
>
> Luke, on behalf of the Apache Beam PMC.
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>



Re: [DISCUSS][PROPOSAL] Improvements to the Apache Beam website

2020-01-27 Thread Heejong Lee
On Mon, Jan 27, 2020 at 11:19 AM Aizhamal Nurmamat kyzy 
wrote:

> Hi Alexey,
>
> Answers are inline:
>
> Do we have any user demands for documentation translation into other
>> languages? I’m asking this because, in my experience, it’s quite tough work
>> to translate everything and it won’t be always up-to-date with the
>> mainstream docs in English.
>>
>
> We know of at least one user who has been trying to grow a Beam community
> in China and translate the documentation with the local community help:
> --> https://github.com/mybeam/Apache-Beam-/tree/master/website
> -->
> https://lists.apache.org/thread.html/6b7008affee7d70aa0ef13bce7d57455c85759b0af7e08582a086f53%40%3Cdev.beam.apache.org%3E
>
> This would hopefully unblock other contributors. For translations, the
> idea is that the source of truth is the english version, and we'll make
> sure it's visible on the header of translated pages, as well as dates for
> the latest updates.
>

+1

I think localized contents help non-english speaking users a lot to spark
the interest even if the contents are somewhat out-of-date.


>
> Also, moving to another doc engine probably will require us to change a
>> format of mark-up language or not?. What are the other advantages of Docsy
>> over Jekyll?
>>
>
> We will have to make small tweaks to the Jekyll MD files, but as Brian
> pointed out in the old thread we can use some tools to automate the process:
> -->   https://gohugo.io/commands/hugo_import_jekyll/
>
> I’d also suggest to improve Beam site context search to be able to
>> differentiate search queries over user documentation and/or API references.
>>
> +1. Will add this as a work item.
>
>


Re: External transform API in Java SDK

2020-01-02 Thread Heejong Lee
If we pass in TypeDescriptor objects instead of Java type information for
the compiler, we could match the returning coders and the given type
descriptors at pipeline construction time. It would be helpful to prevent
pipeline failing by class cast exception in runners. I've create the jira
ticket: https://issues.apache.org/jira/browse/BEAM-9048

On Mon, Dec 30, 2019 at 10:27 AM Luke Cwik  wrote:

>
>
> On Mon, Dec 23, 2019 at 12:20 PM Heejong Lee  wrote:
>
>>
>>
>> On Fri, Dec 20, 2019 at 11:38 AM Luke Cwik  wrote:
>>
>>> What do side inputs look like?
>>>
>>
>> A user needs to first pass PCollections for side inputs into the external
>> transform in addition to ordinary input PCollections and define
>> PCollectionViews inside the external transform something like:
>>
>> PCollectionTuple pTuple =
>> PCollectionTuple.of("main1", main1)
>> .and("main2", main2)
>> .and("side", side)
>> .apply(External.of(...).withMultiOutputs());
>>
>> public static class TestTransform extends PTransform> PCollectionTuple> {
>>   @Override
>>   public PCollectionTuple expand(PCollectionTuple input) {
>> PCollectionView sideView = 
>> input.get("side").apply(View.asSingleton());
>> PCollection main =
>> PCollectionList.of(input.get("main1"))
>> .and(input.get("main2"))
>> .apply(Flatten.pCollections())
>> .apply(
>> ParDo.of(
>> new DoFn() {
>>   @ProcessElement
>>   public void processElement(
>>   @Element String x,
>>   OutputReceiver out,
>>       DoFn.ProcessContext c) {
>> out.output(x + c.sideInput(sideView));
>>   }
>> })
>> .withSideInputs(sideView));
>>
>>
>>
>>> On Thu, Dec 19, 2019 at 4:39 PM Heejong Lee  wrote:
>>>
>>>> I wanted to know if anybody has any comment on external transform API
>>>> for Java SDK.
>>>>
>>>> `External.of()` can create external transform for Java SDK. Depending
>>>> on input and output types, two additional methods are provided:
>>>> `withMultiOutputs()` which specifies the type of PCollection and
>>>> `withOutputType()` which specifies the type of output element. Some
>>>> examples are:
>>>>
>>>> PCollection col =
>>>> testPipeline
>>>> .apply(Create.of("1", "2", "3"))
>>>> .apply(External.of(*...*));
>>>>
>>>> This is okay without additional methods since 1) input and output types
>>>> of external transform can be inferred 2) output PCollection is singular.
>>>>
>>>
>>> How does the type/coder at runtime get inferred (doesn't java's type
>>> erasure get rid of this information)?
>>>
>>
>>>
>>>> PCollectionTuple pTuple =
>>>> testPipeline
>>>> .apply(Create.of(1, 2, 3, 4, 5, 6))
>>>> .apply(
>>>> External.of(*...*).withMultiOutputs());
>>>>
>>>> This requires `withMultiOutputs()` since output PCollection is
>>>> PCollectionTuple.
>>>>
>>>
>>> Shouldn't this require a mapping from "output" name to coder/type
>>> variable to be specified as an argument to withMultiOutputs?
>>>
>>>
>>>> PCollection pCol =
>>>> testPipeline
>>>> .apply(Create.of("1", "2", "2", "3", "3", "3"))
>>>> .apply(
>>>> External.of(...)
>>>> .>withOutputType())
>>>> .apply(
>>>> "toString",
>>>> MapElements.into(TypeDescriptors.strings()).via(   
>>>>  x -> String.format("%s->%s", x.getKey(), x.getValue(;
>>>>
>>>>  This requires `withOutputType()` since the output element type cannot
>>>> be inferred from method chaining. I think some users may feel awkward to
>>>> call method only with the type parameter and empty parenthesis. Without
>>>> `withOutputType()`, the type of outpu

Re: External transform API in Java SDK

2019-12-23 Thread Heejong Lee
On Fri, Dec 20, 2019 at 11:38 AM Luke Cwik  wrote:

> What do side inputs look like?
>

A user needs to first pass PCollections for side inputs into the external
transform in addition to ordinary input PCollections and define
PCollectionViews inside the external transform something like:

PCollectionTuple pTuple =
PCollectionTuple.of("main1", main1)
.and("main2", main2)
.and("side", side)
.apply(External.of(...).withMultiOutputs());

public static class TestTransform extends PTransform {
  @Override
  public PCollectionTuple expand(PCollectionTuple input) {
PCollectionView sideView =
input.get("side").apply(View.asSingleton());
PCollection main =
PCollectionList.of(input.get("main1"))
.and(input.get("main2"))
.apply(Flatten.pCollections())
.apply(
ParDo.of(
new DoFn() {
  @ProcessElement
  public void processElement(
  @Element String x,
  OutputReceiver out,
  DoFn.ProcessContext c) {
out.output(x + c.sideInput(sideView));
  }
})
    .withSideInputs(sideView));



> On Thu, Dec 19, 2019 at 4:39 PM Heejong Lee  wrote:
>
>> I wanted to know if anybody has any comment on external transform API for
>> Java SDK.
>>
>> `External.of()` can create external transform for Java SDK. Depending on
>> input and output types, two additional methods are provided:
>> `withMultiOutputs()` which specifies the type of PCollection and
>> `withOutputType()` which specifies the type of output element. Some
>> examples are:
>>
>> PCollection col =
>> testPipeline
>> .apply(Create.of("1", "2", "3"))
>> .apply(External.of(*...*));
>>
>> This is okay without additional methods since 1) input and output types
>> of external transform can be inferred 2) output PCollection is singular.
>>
>
> How does the type/coder at runtime get inferred (doesn't java's type
> erasure get rid of this information)?
>

>
>> PCollectionTuple pTuple =
>> testPipeline
>> .apply(Create.of(1, 2, 3, 4, 5, 6))
>> .apply(
>> External.of(*...*).withMultiOutputs());
>>
>> This requires `withMultiOutputs()` since output PCollection is
>> PCollectionTuple.
>>
>
> Shouldn't this require a mapping from "output" name to coder/type variable
> to be specified as an argument to withMultiOutputs?
>
>
>> PCollection pCol =
>> testPipeline
>> .apply(Create.of("1", "2", "2", "3", "3", "3"))
>> .apply(
>> External.of(...)
>> .>withOutputType())
>> .apply(
>> "toString",
>> MapElements.into(TypeDescriptors.strings()).via(
>> x -> String.format("%s->%s", x.getKey(), x.getValue(;
>>
>>  This requires `withOutputType()` since the output element type cannot be
>> inferred from method chaining. I think some users may feel awkward to call
>> method only with the type parameter and empty parenthesis. Without
>> `withOutputType()`, the type of output element will be java.lang.Object
>> which might still be forcefully casted to KV.
>>
>
> How does the output type get preserved in this case (since Java's type
> erasure would remove > after compilation and coder
> inference in my opinion should be broken and or choosing something generic
> like serializable)?
>

The expansion service is responsible for using cross-language compatible
coders in the returning expanded transforms and these are the coders used
in the runtime. Type information annotated by additional methods here is
for compile-time type safety of external transforms.


>
>
Thanks,
>> Heejong
>>
>


External transform API in Java SDK

2019-12-19 Thread Heejong Lee
I wanted to know if anybody has any comment on external transform API for
Java SDK.

`External.of()` can create external transform for Java SDK. Depending on
input and output types, two additional methods are provided:
`withMultiOutputs()` which specifies the type of PCollection and
`withOutputType()` which specifies the type of output element. Some
examples are:

PCollection col =
testPipeline
.apply(Create.of("1", "2", "3"))
.apply(External.of(*...*));

This is okay without additional methods since 1) input and output types of
external transform can be inferred 2) output PCollection is singular.

PCollectionTuple pTuple =
testPipeline
.apply(Create.of(1, 2, 3, 4, 5, 6))
.apply(
External.of(*...*).withMultiOutputs());

This requires `withMultiOutputs()` since output PCollection is
PCollectionTuple.

PCollection pCol =
testPipeline
.apply(Create.of("1", "2", "2", "3", "3", "3"))
.apply(
External.of(...)
.>withOutputType())
.apply(
"toString",
MapElements.into(TypeDescriptors.strings()).via(
 x -> String.format("%s->%s", x.getKey(), x.getValue(;

 This requires `withOutputType()` since the output element type cannot be
inferred from method chaining. I think some users may feel awkward to call
method only with the type parameter and empty parenthesis. Without
`withOutputType()`, the type of output element will be java.lang.Object
which might still be forcefully casted to KV.

Thanks,
Heejong


Re: Artifact staging in cross-language pipelines

2019-12-12 Thread Heejong Lee
I'm brushing up memory by revisiting the doc[1] and it seems like we've
already reached the consensus on the bigger picture. I would start drafting
the implementation plan.

[1]:
https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing

On Tue, Nov 26, 2019 at 3:54 AM Maximilian Michels  wrote:

> Hey Heejong,
>
> I don't think so. It would be great to push this forward.
>
> Thanks,
> Max
>
> On 26.11.19 02:49, Heejong Lee wrote:
> > Hi,
> >
> > Is anyone actively working on artifact staging extension for
> > cross-language pipelines? I'm thinking I can contribute to it in coming
> > Dec. If anyone has any progress on this and needs help, please let me
> know.
> >
> > Thanks,
> >
> > On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía  > <mailto:ieme...@gmail.com>> wrote:
> >
> > Can you please add this to the design documents webpage.
> > https://beam.apache.org/contribute/design-documents/
> >
> > On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
> > mailto:chamik...@google.com>> wrote:
> >  >
> >  >
> >  >
> >  > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
> > mailto:m...@apache.org>> wrote:
> >  >>
> >  >> Here's the first draft:
> >  >>
> >
> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
> >  >>
> >  >> It's rather high-level. We may want to add more details once we
> have
> >  >> finalized the design. Feel free to make comments and edits.
> >  >
> >  >
> >  > Thanks Max. Added some comments.
> >  >
> >  >>
> >  >>
> >  >> > All of this goes back to the idea that I think the listing of
> >  >> > artifacts (or more general dependencies) should be a property
> > of the
> >  >> > environment themselves.
> >  >>
> >  >> +1 I came to the same conclusion while thinking about how to
> store
> >  >> artifact information for deferred execution of the pipeline.
> >  >>
> >  >> -Max
> >  >>
> >  >> On 07.05.19 18:10, Robert Bradshaw wrote:
> >  >> > Looking forward to your writeup, Max. In the meantime, some
> > comments below.
> >  >> >
> >  >> >
> >  >> > From: Lukasz Cwik mailto:lc...@google.com>>
> >  >> > Date: Thu, May 2, 2019 at 6:45 PM
> >  >> > To: dev
> >  >> >
> >  >> >>
> >  >> >>
> >  >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> >  >> >>>
> >  >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
> > mailto:lc...@google.com>> wrote:
> >  >> >>>>
> >  >> >>>> We should stick with URN + payload + artifact metadata[1]
> > where the only mandatory one that all SDKs and expansion services
> > understand is the "bytes" artifact type. This allows us to add
> > optional URNs for file://, http://, Maven, PyPi, ... in the future.
> > I would make the artifact staging service use the same URN + payload
> > mechanism to get compatibility of artifacts across the different
> > services and also have the artifact staging service be able to be
> > queried for the list of artifact types it supports.
> >  >> >>>
> >  >> >>> +1
> >  >> >>>
> >  >> >>>> Finally, we would need to have environments enumerate the
> > artifact types that they support.
> >  >> >>>
> >  >> >>> Meaning at runtime, or as another field statically set in
> > the proto?
> >  >> >>
> >  >> >>
> >  >> >> I don't believe runners/SDKs should have to know what
> > artifacts each environment supports at runtime and instead have
> > environments enumerate them explicitly in the proto. I have been
> > thinking about a more general "capabilities" block on environments
> > which allow them to enumerate URNs that the environment understands.
> > This would include artifact type URNs, PTransform URNs, coder URNs,
> 

Revamping the cross-language validate runner test suite

2019-11-08 Thread Heejong Lee
Hi,

I'm working on revamping the cross-language validate runner test suite. Our
current test suite for the cross-language transform is incomplete as it
only has tests for Wordcount, DoFn, basic Count and basic Filter
transforms. My plan is, in addition to our existing set of tests, to add
all primitive transforms to the suite like GroupByKey, CoGroupByKey,
Combine, Flatten and Partition.

The link for the design doc is attached. Please feel free to comment.

https://docs.google.com/document/d/1xQp0ElIV84b8OCVz8CD2hvbiWdR8w4BvWxPTZJZA6NA/edit?usp=sharing


Re: published containers overwrite locally built containers

2019-11-06 Thread Heejong Lee
I think that implicitly (and forcefully) pull the remote image is not good
even in case of a bug fix. The better approach would be releasing a
separate bug fix version. Implicitly pulling the updated version of the
same container looks weird to me since it feels like releasing the jar
artifact with the same version multiple times or publishing already
published git branch again. However, I understand it's much easier to just
update the container with the same tag than release another Beam version.

On Wed, Nov 6, 2019 at 8:05 AM Valentyn Tymofieiev 
wrote:

> I agree with the resolutions in the link Thomas mentioned [1]. Using
> latest tag is not reliable, and a unique tag ID should be generated when
> running tests on Jenkins against master branch.
> I think pulling the latest image for the current tag is actually a desired
> behavior, in case the external image was updated (due to a bug fix for
> example). Our custom container documentation should reflect this behavior.
> Consider continuing the conversation in [1] to keep it in one place if
> there are other suggestions/opinions.
>
> [1]
> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>
>
> On Fri, Nov 1, 2019 at 5:14 PM Thomas Weise  wrote:
>
>> More here:
>> https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E
>>
>>
>> On Fri, Nov 1, 2019 at 10:56 AM Chamikara Jayalath 
>> wrote:
>>
>>> I think it makes sense to override published docker images with locally
>>> built versions when testing HEAD.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, Oct 31, 2019 at 6:31 PM Heejong Lee  wrote:
>>>
>>>> Hi, happy halloween!
>>>>
>>>> I'm looking into failing cross language post commit tests:
>>>> https://issues.apache.org/jira/browse/BEAM-8534
>>>> <https://issues.apache.org/jira/browse/BEAM-8534?filter=-1>
>>>>
>>>> After a few runs, I've found that published SDK harness containers
>>>> overwrite locally built containers when docker pull happens. I can think of
>>>> two possible solutions here: 1) remove the published images with the latest
>>>> tag, so make the image with the latest tag available for testing and
>>>> development purposes. 2) put serialVersionUID to the class printing out the
>>>> error.
>>>>
>>>> 2) doesn't sound like a fundamental solution if we're not going to
>>>> attach serialVersionUID to all serializable classes. 1) might work but I'm
>>>> not sure whether there's another use for the latest tag somewhere. Any
>>>> better ideas?
>>>>
>>>> Thanks,
>>>> Heejong
>>>>
>>>


***UNCHECKED*** Re: published containers overwrite locally built containers

2019-11-01 Thread Heejong Lee
Since 'docker run' automatically pulls when the image doesn't exist
locally, I think it's safe to remove explicit 'docker pull' before 'docker
run'. Without 'docker pull', we won't update the local image with the
remote image (for the same tag) but it shouldn't be a problem in prod that
the unique tag is assumed for each released version.

On Fri, Nov 1, 2019 at 10:56 AM Chamikara Jayalath 
wrote:

> I think it makes sense to override published docker images with locally
> built versions when testing HEAD.
>
> Thanks,
> Cham
>
> On Thu, Oct 31, 2019 at 6:31 PM Heejong Lee  wrote:
>
>> Hi, happy halloween!
>>
>> I'm looking into failing cross language post commit tests:
>> https://issues.apache.org/jira/browse/BEAM-8534
>> <https://issues.apache.org/jira/browse/BEAM-8534?filter=-1>
>>
>> After a few runs, I've found that published SDK harness containers
>> overwrite locally built containers when docker pull happens. I can think of
>> two possible solutions here: 1) remove the published images with the latest
>> tag, so make the image with the latest tag available for testing and
>> development purposes. 2) put serialVersionUID to the class printing out the
>> error.
>>
>> 2) doesn't sound like a fundamental solution if we're not going to attach
>> serialVersionUID to all serializable classes. 1) might work but I'm not
>> sure whether there's another use for the latest tag somewhere. Any better
>> ideas?
>>
>> Thanks,
>> Heejong
>>
>


published containers overwrite locally built containers

2019-10-31 Thread Heejong Lee
Hi, happy halloween!

I'm looking into failing cross language post commit tests:
https://issues.apache.org/jira/browse/BEAM-8534


After a few runs, I've found that published SDK harness containers
overwrite locally built containers when docker pull happens. I can think of
two possible solutions here: 1) remove the published images with the latest
tag, so make the image with the latest tag available for testing and
development purposes. 2) put serialVersionUID to the class printing out the
error.

2) doesn't sound like a fundamental solution if we're not going to attach
serialVersionUID to all serializable classes. 1) might work but I'm not
sure whether there's another use for the latest tag somewhere. Any better
ideas?

Thanks,
Heejong


Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-26 Thread Heejong Lee
Congratulations! :)

On Mon, Aug 26, 2019 at 2:44 PM Rui Wang  wrote:

> Congratulations!
>
>
> -Rui
>
> On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang 
> wrote:
>
>> Congratulations Valentyn, well deserved!
>>
>> On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats Valentyn!
>>>
>>> On Mon, Aug 26, 2019 at 2:32 PM Pablo Estrada 
>>> wrote:
>>>
 Thanks Valentyn!

 On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu  wrote:

> Thank you Valentyn! Congratulations!
>
> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw 
> wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Valentyn Tymofieiev
>>
>> Valentyn has made numerous contributions to Beam over the last several
>> years (including 100+ pull requests), most recently pushing through
>> the effort to make Beam compatible with Python 3. He is also an active
>> participant in design discussions on the list, participates in release
>> candidate validation, and proactively helps keep our tests green.
>>
>> In consideration of Valentyn's contributions, the Beam PMC trusts him
>> with the responsibilities of a Beam committer [1].
>>
>> Thank you, Valentyn, for your contributions and looking forward to
>> many more!
>>
>> Robert, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>


Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-07 Thread Heejong Lee
Congratulations!

On Wed, Aug 7, 2019 at 11:05 AM Tanay Tummalapalli 
wrote:

> Congratulations!
>
> On Wed, Aug 7, 2019 at 11:27 PM Robin Qiu  wrote:
>
>> Congratulations, Kyle!
>>
>> On Wed, Aug 7, 2019 at 5:04 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congrats, Kyle!
>>>
>>> On Wed, Aug 7, 2019 at 1:01 PM Ismaël Mejía  wrote:
>>>
 Congrats Kyle, well deserved :clap: !

 On Wed, Aug 7, 2019, 11:22 AM Gleb Kanterov  wrote:

> Congratulations!
>
> On Wed, Aug 7, 2019 at 7:01 AM Connell O'Callaghan <
> conne...@google.com> wrote:
>
>> Well done congratulations Kyle!!!
>>
>> On Tue, Aug 6, 2019 at 21:58 Thomas Weise  wrote:
>>
>>> Congrats!
>>>
>>> On Tue, Aug 6, 2019, 7:24 PM Reza Rokni  wrote:
>>>
 Congratz!

 On Wed, 7 Aug 2019 at 06:40, Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats!!
>
> On Tue, Aug 6, 2019 at 3:33 PM Udi Meiri  wrote:
>
>> Congrats Kyle!
>>
>> On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak <
>> meliss...@google.com> wrote:
>>
>>> Congratulations Kyle!
>>>
>>> On Tue, Aug 6, 2019 at 1:36 PM Yichi Zhang 
>>> wrote:
>>>
 Congrats Kyle!

 On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Thank you, Kyle! And congratulations :)
>
> On Tue, Aug 6, 2019 at 10:09 AM Hannah Jiang <
> hannahji...@google.com> wrote:
>
>> Congrats Kyle!
>>
>> On Tue, Aug 6, 2019 at 9:52 AM David Morávek <
>> david.mora...@gmail.com> wrote:
>>
>>> Congratulations Kyle!!
>>>
>>> Sent from my iPhone
>>>
>>> On 6 Aug 2019, at 18:47, Anton Kedin 
>>> wrote:
>>>
>>> Congrats!
>>>
>>> On Tue, Aug 6, 2019, 9:37 AM Ankur Goenka 
>>> wrote:
>>>
 Congratulations Kyle!

 On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay <
 al...@google.com> wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a
> new committer: Kyle Weaver.
>
> Kyle has been contributing to Beam for a while now. And in
> that time period Kyle got the portable spark runner feature 
> complete for
> batch processing. [1]
>
> In consideration of Kyle's contributions, the Beam PMC
> trusts him with the responsibilities of a Beam committer
>  [2].
>
> Thank you, Kyle, for your contributions and looking
> forward to many more!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
> https://lists.apache.org/thread.html/c43678fc24c9a1dc9f48c51c51950aedcb9bc0fd3b633df16c3d595a@%3Cuser.beam.apache.org%3E
> [2] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>


 --

 This email may be confidential and privileged. If you received this
 communication by mistake, please don't forward it to anyone else, 
 please
 erase all copies and attachments, and please let me know that it has 
 gone
 to the wrong person.

 The above terms reflect a potential business arrangement, are
 provided solely as a basis for further discussion, and are not 
 intended to
 be and do not constitute a legally binding obligation. No legally 
 binding
 obligations will be created, implied, or inferred until an agreement in
 final form is executed in writing by all parties involved.

>>>
>
> --
> Cheers,
> Gleb
>



Re: How to expose/use the External transform on Java SDK

2019-07-24 Thread Heejong Lee
I think it depends how we define "the core" part of the SDK. If we define
the core as only the (abstract) data types which describe BEAM pipeline
model then it would be more sensible to put external transform into a
separate extension module (option 4). Otherwise, option 1 makes sense.

On Wed, Jul 24, 2019 at 11:56 AM Chamikara Jayalath 
wrote:

> The idea of 'ExternalTransform' is to allow users to use transforms in SDK
> X from SDK Y. I think this should be a core part of each SDK and
> corresponding external transforms ([a] for Java, [b] for Python) should be
> released with each SDK. This will also allow us to add core external
> transforms to some of the critical transforms that are not available in
> certain SDKs. So I prefer option (1).
>
> Rebo, I didn't realize there's an external transform in Go SDK. Looking at
> it, seems like it's more of an interface for native transforms implemented
> in each runner, not for cross-language use-cases. Is that correct ? May be
> we can reuse it for latter as well.
>
> Thanks,
> Cham
>
> [a]
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java
> [b]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py
>
> On Wed, Jul 24, 2019 at 10:25 AM Robert Burke  wrote:
>
>> Ideas inline.
>>
>> On Wed, Jul 24, 2019, 9:56 AM Ismaël Mejía  wrote:
>>
>>> After Beam Summit EU I was curious about the External transform. I was
>>> interested on the scenario of using it to call python code in the
>>> middle of a Java pipeline. This is a potentially useful scenario for
>>> example to evaluate models from python ML frameworks on Java
>>> pipelines. In my example I did a transform to classify elements in a
>>> simple Python ParDo and tried to connect it via the Java External
>>> transform.
>>>
>>> I found that the ExternalTransform code was added into
>>> `runners/core-construction-java` as part of BEAM-6747 [1]. However
>>> this code is not exposed currently as part of the Beam Java SDK, so
>>> end users won’t be able to find it easily. I found this weird and
>>> thought well it will be as simple as to move it into the Java SDK and
>>> voila!
>>>
>>> But of course this could not be so easy because this transform calls
>>> the Expansion service via gRPC and Java SDK does not have (and
>>> probably should not have) gRPC in its dependencies.
>>> So my second reflex was to add it into Java SDK and translate it a
>>> generic expansion all the runners, but this may not make sense because
>>> the External transform is not part of the runner translation since
>>> this is part of the Pipeline construction process (as pointed to me by
>>> Max in a slack discussion).
>>>
>>> So the question is: How do you think this should be exposed to the end
>>> users?
>>>
>>> 1. Should we add gRPC with all its deps to SDKs Java core? (this of
>>> course it is not nice because we will leak our vendored gRPC and
>>> friends into users classpath).
>>>
>> If there's separation between the SDK and the Harness then this makes
>> sense. Otherwise the portable harness depends on GRPC at present, doesn't
>> it? Presently the Go SDK kicks off the harness, and then carries the GRPC
>> dependency (Though that's separable if necessary.)
>>
>>> 2. Should we do the dynamic loading of classes only an runtime if the
>>> transform is used to avoid the big extra compile dependency (and add
>>> runners/core-construction-java) as a runtime dependency.
>>> 3. Should we create a ‘shim’ module to hide the gRPC dependency and
>>> load the gRPC classes dynamically on it when the External transform is
>>> part of the pipeline.
>>> 4. Should we pack it as an extension (with the same issue of needing
>>> to leak the dependencies, but with less impact for users who do not
>>> use External) ?
>>> 5. Other?
>>>
>>> The ‘purist’ me thinks we should have External in sdks/java/core but
>>> maybe it is better not to. Any other opinions or ideas?
>>>
>>
>> The Go SDK supports External in it's core transforms set  However it
>> would be the callers are able to populate the data field however they need
>> to, whether that's some "known" configuration object or something sourced
>> from another service (eg the expansion service). The important part on the
>> other side is that the runner knows what to do with it.
>>
>> The non-portable pubsubio in the Go SDK is an example [1] using External
>> currently. The Dataflow runner recognizes it, and makes the substitution.
>> Eventually once the SDK supports SDF that can generate unbounded
>> PCollections, this will likely be replaced with that kind of
>> implementation, and the the existing "External" version will be moved to
>> part of the Go SDKs Dataflow runner package.
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/pubsubio/pubsubio.go#L65
>>
>>>
>>> Thanks,
>>> Ismaël
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6747
>>>
>>

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-16 Thread Heejong Lee
Congratulations!

On Tue, Jul 16, 2019 at 1:34 PM Chamikara Jayalath 
wrote:

> Congrats!!
>
> On Tue, Jul 16, 2019 at 1:31 PM Robin Qiu  wrote:
>
>> Congrats, Robert!!
>>
>> On Tue, Jul 16, 2019 at 1:22 PM Alan Myrvold  wrote:
>>
>>> Congrats, Robert!
>>>
>>> On Tue, Jul 16, 2019 at 11:46 AM Ismaël Mejía  wrote:
>>>
 Congrats Robert!


 On Tue, Jul 16, 2019 at 8:19 PM Yichi Zhang  wrote:
 >
 > Congratulations!
 >
 > On Tue, Jul 16, 2019 at 10:51 AM Holden Karau 
 wrote:
 >>
 >> Congratulations! :)
 >>
 >> On Tue, Jul 16, 2019 at 10:50 AM Mikhail Gryzykhin <
 mig...@google.com> wrote:
 >>>
 >>> Congratulations!
 >>>
 >>> On Tue, Jul 16, 2019 at 10:36 AM Ankur Goenka 
 wrote:
 
  Congratulations Robert!
 
  Go GO!
 
  On Tue, Jul 16, 2019 at 10:34 AM Rui Wang 
 wrote:
 >
 > Congrats!
 >
 >
 > -Rui
 >
 > On Tue, Jul 16, 2019 at 10:32 AM Udi Meiri 
 wrote:
 >>
 >> Congrats Robert B.!
 >>
 >> On Tue, Jul 16, 2019 at 10:23 AM Ahmet Altay 
 wrote:
 >>>
 >>> Hi,
 >>>
 >>> Please join me and the rest of the Beam PMC in welcoming a new
 committer: Robert Burke.
 >>>
 >>> Robert has been contributing to Beam and actively involved in
 the community for over a year. He has been actively working on Go SDK,
 helping users, and making it easier for others to contribute [1].
 >>>
 >>> In consideration of Robert's contributions, the Beam PMC trusts
 him with the responsibilities of a Beam committer [2].
 >>>
 >>> Thank you, Robert, for your contributions and looking forward
 to many more!
 >>>
 >>> Ahmet, on behalf of the Apache Beam PMC
 >>>
 >>> [1]
 https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
 >>> [2]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
 >>
 >>
 >>
 >> --
 >> Twitter: https://twitter.com/holdenkarau
 >> Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9
 >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Heejong Lee
Congratulations!

On Wed, May 15, 2019 at 12:24 PM Niklas Hansson <
niklas.sven.hans...@gmail.com> wrote:

> Congratulations Pablo :)
>
> Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang :
>
>> Congratulations, Pablo!
>>
>> *From: *Charles Chen 
>> *Date: *Wed, May 15, 2019 at 11:04 AM
>> *To: *dev
>>
>> Congrats Pablo and thank you for your contributions!
>>>
>>> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congrats, Pablo!

 On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:

> Congratulations, Pablo!
>
> *From: *Maximilian Michels 
> *Date: *Wed, May 15, 2019 at 2:06 AM
> *To: * 
>
> Congrats Pablo! Thank you for your help to grow the Beam community!
>>
>> On 15.05.19 10:33, Tim Robertson wrote:
>> > Congratulations Pablo
>> >
>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía > > > wrote:
>> >
>> > Congrats Pablo, well deserved, nece to see your work recognized!
>> >
>> > On Wed, May 15, 2019 at 9:59 AM Pei HE > > > wrote:
>> >  >
>> >  > Congrats, Pablo!
>> >  >
>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>> >  > mailto:ttanay.apa...@gmail.com>>
>> wrote:
>> >  > >
>> >  > > Congratulations Pablo!
>> >  > >
>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>> adude3...@gmail.com
>> > > wrote:
>> >  > >>
>> >  > >> Congrats, Pablo!
>> >  > >>
>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>> > mailto:conne...@google.com>> wrote:
>> >  > >>>
>> >  > >>> Awesome well done Pablo!!!
>> >  > >>>
>> >  > >>> Kenn thank you for sharing this great news with us!!!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  > 
>> >  >  Congratulations!
>> >  > 
>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>> > mailto:rob...@frantil.com>> wrote:
>> >  > >
>> >  > > Woohoo! Well deserved.
>> >  > >
>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
>> re...@google.com
>> > > wrote:
>> >  > >>
>> >  > >> Congratulations!
>> >  > >>
>> >  > >> From: Mikhail Gryzykhin > > >
>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>> >  > >> To: mailto:dev@beam.apache.org
>> >>
>> >  > >>
>> >  > >>> Congratulations Pablo!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  > 
>> >  >  Hi all,
>> >  > 
>> >  >  Please join me and the rest of the Beam PMC in
>> welcoming
>> > Pablo Estrada to join the PMC.
>> >  > 
>> >  >  Pablo first picked up BEAM-722 in October of 2016
>> and
>> > has been a steady part of the Beam community since then. In
>> addition
>> > to technical work on Beam Python & Java & runners, I would
>> highlight
>> > how Pablo grows Beam's community by helping users, working on
>> GSoC,
>> > giving talks at Beam Summits and other OSS conferences including
>> > Flink Forward, and holding training workshops. I cannot do
>> justice
>> > to Pablo's contributions in a single paragraph.
>> >  > 
>> >  >  Thanks Pablo, for being a part of Beam.
>> >  > 
>> >  >  Kenn
>> >
>>
>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Heejong Lee
Congratulations!

On Fri, May 3, 2019 at 3:53 PM Reza Rokni  wrote:

> Congratulations !
>
> *From: *Reuven Lax 
> *Date: *Sat, 4 May 2019, 06:42
> *To: *dev
>
> Thank you!
>>
>> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:
>>
>>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
>>> wrote:
>>>
 Well done Udi!!! Congratulations and thank you for your
 contributions!!!

 Kenn thank you for sharing!!!

 On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:

> Thanks Udi and congratulations!
>
> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>
>> Congratulations Udi!!!
>>
>> *From: *Ruoyun Huang 
>> *Date: *Fri, May 3, 2019 at 2:39 PM
>> *To: * 
>>
>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>>
 Congratulations, Udi!

 *From: *Kyle Weaver 
 *Date: *Fri, May 3, 2019 at 2:11 PM
 *To: * 

 Congratulations Udi! I look forward to sending you all my reviews
> for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen 
> wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles <
> k...@apache.org> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a
> new committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017, starting
> with HDFS support in the Python SDK and continuing with a ton of 
> Python
> work. I also will highlight his work on community-building 
> infrastructure,
> including documentation, experiments with ways to find reviewers for 
> pull
> requests, gradle build work, analyzing and reducing build times.
> 
>  In consideration of Udi's contributions, the Beam PMC trusts
> Udi with the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>


Re: Artifact staging in cross-language pipelines

2019-04-23 Thread Heejong Lee
2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw 님이 작성:

> I've been out, so coming a bit late to the discussion, but here's my
> thoughts.
>
> The expansion service absolutely needs to be able to provide the
> dependencies for the transform(s) it expands. It seems the default,
> foolproof way of doing this is via the environment, which can be a
> docker image with all the required dependencies. More than this an
> (arguably important, but possibly messy) optimization.
>
> The standard way to provide artifacts outside of the environment is
> via the artifact staging service. Of course, the expansion service may
> not have access to the (final) artifact staging service (due to
> permissions, locality, or it may not even be started up yet) but the
> SDK invoking the expansion service could offer an artifact staging
> environment for the SDK to publish artifacts to. However, there are
> some difficulties here, in particular avoiding name collisions with
> staged artifacts, assigning semantic meaning to the artifacts (e.g.
> should jar files get automatically placed in the classpath, or Python
> packages recognized and installed at startup). The alternative is
> going with a (type, pointer) scheme for naming dependencies; if we go
> this route I think we should consider migrating all artifact staging
> to this style. I am concerned that the "file" version will be less
> than useful for what will become the most convenient expansion
> services (namely, hosted and docker image). I am still at a loss,
> however, as to how to solve the diamond dependency problem among
> dependencies--perhaps the information is there if one walks
> maven/pypi/go modules/... but do we expect every runner to know about
> every packaging platform? This also wouldn't solve the issue if fat
> jars are used as dependencies. The only safe thing to do here is to
> force distinct dependency sets to live in different environments,
> which could be too conservative.
>
> This all leads me to think that perhaps the environment itself should
> be docker image (often one of "vanilla" beam-java-x.y ones) +
> dependency list, rather than have the dependency/artifact list as some
> kind of data off to the side. In this case, the runner would (as
> requested by its configuration) be free to merge environments it
> deemed compatible, including swapping out beam-java-X for
> beam-java-embedded if it considers itself compatible with the
> dependency list.


Like this idea to build multiple docker environments on top of a bare
minimum SDK harness container and allow runners to pick a suitable one
based on a dependency list.


>
> I agree with Thomas that we'll want to make expansion services, and
> the transforms they offer, more discoverable. The whole lifetime cycle
> of expansion services is something that has yet to be fully fleshed
> out, and may influence some of these decisions.
>
> As for adding --jar_package to the Python SDK, this seems really
> specific to calling java-from-python (would we have O(n^2) such
> options?) as well as out-of-place for a Python user to specify. I
> would really hope we can figure out a more generic solution. If we
> need this option in the meantime, let's at least make it clear
> (probably in the name) that it's temporary.
>

Good points. I second that we need a more generic solution than
python-to-java specific option. I think instead of naming differently we
can make --jar_package a secondary option under --experiment in the
meantime. WDYT?


> On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise  wrote:
> >
> > One more suggestion:
> >
> > It would be nice to be able to select the environment for the external
> transforms. For example, I would like to be able to use EMBEDDED for Flink.
> That's implicit for sources which are runner native unbounded read
> translations, but it should also be possible for writes. That would then be
> similar to how pipelines are packaged and run with the "legacy" runner.
> >
> > Thomas
> >
> >
> > On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka  wrote:
> >>
> >> Great discussion!
> >> I have a few points around the structure of proto but that is less
> important as it can evolve.
> >> However, I think that artifact compatibility is another important
> aspect to look at.
> >> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses 1.8><1.9 and
> TransformC uses 1.6><1.8. As sdk provide the environment for each
> transform, it can not simply say EnvironmentJava for both TransformA and
> TransformB as the dependencies are not compatible.
> >> We should have separate environment associated with TransformA and
> TransformB in this case.
> >>
> >> To support this case, we need 2 things.
> >> 1: Granular metadata about the dependency including type.
> >> 2: Complete list of the transforms to be expanded.
> >>
> >> Elaboration:
> >> The compatibility check can be done in a crude way if we provide all
> the metadata about the dependency to expansion service.
> >> Also, the expansion service should expand 

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Heejong Lee
t;>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate
> copy for Python ? I didn't see a significant difference in definitions
> looking at few random coders there but I might have missed something. If
> there's no reason to maintain two, we should probably unify.
> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
> >>>>>
> >>>>
> >>>> Not certain as well. I did notice the timer coder definition didn't
> exist in the Python copy.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as
> tests to the Java and Python SDKs to ensure interoperability.
> >>>>>>
> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >>>>>>
> >>>>>> Chamikara, creating new coder classes is a pain since the type ->
> coder mapping per SDK language would select the non-well known type if we
> added a new one to a language. If we swapped the default type->coder
> mapping, this would still break update for pipelines forcing users to
> update their code to select the non-well known type. If we don't change the
> default type->coder mapping, the well known coder will gain little usage. I
> think we should fix the Python coder to use the same encoding as Java for
> UTF-8 strings before there are too many Python SDK users.
> >>>>>
> >>>>>
> >>>>> I was thinking that may be we should just change the default UTF-8
> coder for Fn API path which is experimental. Updating Python to do what's
> done for Java is fine if we agree that encoding used for Java should be the
> standard.
> >>>>>
> >>>>
> >>>> That is a good idea to use the Fn API experiment to control which
> gets selected.
> >>>>
> >>>>>>
> >>>>>>
> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >>>>>>
> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>>>>>>
> >>>>>>>> A URN defines the encoding.
> >>>>>>>>
> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder
> (defined
> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
> >>>>>>>> unnested one does not.
> >>>>>>>
> >>>>>>>
> >>>>>>> Could you clarify where we define the exact encoding ? I only see
> a URN for UTF-8 [1] while if you look at the implementations Java includes
> length in the encoding [1] while Python [1] does not.
> >>>>>>>
> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> We should define the spec clearly and have cross-language tests.
> >>>>>>>
> >>>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Regarding backwards compatibility, I agree that we should probably
> not update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?)

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Heejong Lee
On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath 
wrote:

>
>
> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw 
> wrote:
>
>> A URN defines the encoding.
>>
>> There are (unfortunately) *two* encodings defined for a Coder (defined
>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> Python, the nested one prefixes with a var-int length, and the
>> unnested one does not.
>>
>
> Could you clarify where we define the exact encoding ? I only see a URN
> for UTF-8 [1] while if you look at the implementations Java includes length
> in the encoding [1] while Python [1] does not.
>
> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>
>
>
>>
>> We should define the spec clearly and have cross-language tests.
>>
>
> +1
>
> Regarding backwards compatibility, I agree that we should probably not
> update existing coder classes. Probably we should just standardize the
> correct encoding (may be as a comment near corresponding URN in the
> beam_runner_api.proto ?) and create new coder classes as needed.
>

Then how do we pair the type and the coder? For Java we can explicitly
assign a specific coder to PCollection but for python a coder is inferred
from an element type of PCollection. If we create another standard coder
for utf-8 string, would that new coder be a default for the string element
type?


>
>
>>
>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada  wrote:
>> >
>> > Could this be a backwards-incompatible change that would break
>> pipelines from upgrading? If they have data in-flight in between operators,
>> and we change the coder, they would break?
>> > I know very little about coders, but since nobody has mentioned it, I
>> wanted to make sure we have it in mind.
>> > -P.
>> >
>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles  wrote:
>> >>
>> >> Agree that a coder URN defines the encoding. I see that string UTF-8
>> was added to the proto enum, but it needs a written spec of the encoding.
>> Ideally some test data that different languages can use to drive compliance
>> testing.
>> >>
>> >> Kenn
>> >>
>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke 
>> wrote:
>> >>>
>> >>> String UTF8 was recently added as a "standard coder " URN in the
>> protos, but I don't think that developed beyond Java, so adding it to
>> Python would be reasonable in my opinion.
>> >>>
>> >>> The Go SDK handles Strings as "custom coders" presently which for Go
>> are always length prefixed (and reported to the Runner as LP+CustomCoder).
>> It would be straight forward to add the correct handling for strings, as Go
>> natively treats strings as UTF8.
>> >>>
>> >>>
>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee  wrote:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>> length of the input string before actual data bytes however StrUtf8Coder in
>> Python SDK directly encodes the input string to bytes value. For the last
>> few weeks, I've been testing and fixing cross-language IO transforms and
>> this discrepancy is a major blocker for me. IMO, we should unify the
>> encoding schemes of UTF8 strings across the different SDKs and make it a
>> standard coder. Any thoughts?
>> >>>>
>> >>>> Thanks,
>>
>


Re: How to use "PortableRunner" in Python SDK?

2019-01-22 Thread Heejong Lee
You can also try without --streaming option. There's a separate streaming
wordcount example in the same directory.

If you want to look into the output files, it would be easier to use
external target like gs:// instead of local file.

python -m apache_beam.examples.wordcount --input=/etc/profile
--output=gs://tmp_location/py-wordcount --runner=PortableRunner
--job_endpoint=localhost:8099 --parallelism=1

On Tue, Jan 22, 2019 at 11:44 AM junwa...@gmail.com 
wrote:

> Hello,
>
> I tried to follow the instructions at
> https://beam.apache.org/roadmap/portability/#python-on-flink,
>
> 1. I installed Flink local cluster, and followed their
> SocketWindowWordCount example and confirmed  the cluster works properly.
>
> 2. Start Flink job server:
> ./gradlew :beam-runners-flink_2.11-job-server:runShadow
> -PflinkMasterUrl=localhost:8081
>
> 3. Subject the job as suggested by an earlier thread:
> python -m apache_beam.examples.wordcount --input=/etc/profile
> --output=/tmp/py-wordcount-direct --runner=PortableRunner
> --job_endpoint=localhost:8099 --parallelism=1
> --OPTIONALflink_master=localhost:8081 --streaming
>
> But got the following NullPointerException error (sorry for the long text
> below), any ideas? Thanks
>
> Jun Wan
>
>  log starts 
> [grpc-default-executor-2] INFO
> org.apache.beam.runners.flink.FlinkJobInvoker - Invoking job
> BeamApp-jwan-0121192804-387b3baa_1d32eea3-d71a-45a9-afa8-edbc66bc1d6b
> [grpc-default-executor-2] INFO
> org.apache.beam.runners.flink.FlinkJobInvocation - Starting job invocation
> BeamApp-jwan-0121192804-387b3baa_1d32eea3-d71a-45a9-afa8-edbc66bc1d6b
> [flink-runner-job-server] INFO
> org.apache.beam.runners.flink.FlinkJobInvocation - Translating pipeline to
> Flink program.
> [flink-runner-job-server] INFO
> org.apache.beam.runners.flink.FlinkExecutionEnvironments - Creating a
> Streaming Environment.
> [flink-runner-job-server] INFO
> org.apache.beam.runners.flink.FlinkExecutionEnvironments - Using Flink
> Master URL localhost:8081.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & Serialization" for details of the effect on
> performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - class
> org.apache.beam.sdk.transforms.join.RawUnionValue does not contain a setter
> for field unionTag
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - Class class
> org.apache.beam.sdk.transforms.join.RawUnionValue cannot be used as a POJO
> type because not all fields are valid POJO fields, and must be processed as
> GenericType. Please read the Flink documentation on "Data Types &
> Serialization" for details of the effect on performance.
> [flink-runner-job-server] INFO
> org.apache.flink.api.java.typeutils.TypeExtractor - No fields were detected
> for class org.apache.beam.sdk.util.WindowedValue so it cannot be used as a
> POJO type and must be processed as GenericType. Please read the Flink
> documentation on "Data Types & 

[PROPOSAL] decrease the number of threads for BigQuery streaming insertAll

2019-01-16 Thread Heejong Lee
Hi,

I want to suggest the change[1] of the thread pool type in BigQuery
streaming insert for Java SDK (BEAM-6443). When we insert small data into
BigQuery very fast by using BigQueryIO.write, it generates lots of rate
limit exceeded errors in a log file. It's mainly because the number of
threads to be used for the inserting job is just too large (50 shards *
dozens of futures executed by unlimited thread pool per each bundle). I've
conducted some benchmarks[2] and could see that the change from unlimited
thread pool to single thread pool reduces the number of (same repeated,
possibly meaningless) error messages by 1/4 while retaining the same
performance. I think that this change will not break any important
performance measure but if anybody has any concerns about this change
please let me know.

Thanks,

[1] https://github.com/apache/beam/pull/7547
[2]
https://docs.google.com/document/d/1EhRNWLevm86GD_QtvlrTauHITVMwQBzuemyp-w4Z_ck/edit#heading=h.c0angyd9tn21


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Heejong Lee
ink
>> it allows to break builds and only does coverage. Am I correct?
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>> On Thu, Jan 3, 2019 at 2:18 PM Kenneth Knowles  wrote:
>>
>>> It would be very useful to have line and/or branch coverage visible.
>>> These are both very weak proxies for quality or reliability, so IMO strict
>>> thresholds are not helpful. One thing that is super useful is to integrate
>>> line coverage into code review, like this:
>>> https://docs.codecov.io/docs/browser-extension. It is very easy to
>>> notice major missing tests.
>>>
>>> We have never really used Sonarqube. It was turned on as a possibility
>>> in the early days but never worked on past that point. Could be nice. I
>>> suspect there's a lot to be gained by just finding very low numbers and
>>> improving them. So just running Jacoco's offline HTML generation would do
>>> it (also this integrates with Jenkins). I tried this the other day and
>>> discovered that our gradle config is broken and does not wire tests and
>>> coverage reporting together properly. Last thing: How is "technical debt"
>>> measured? I'm skeptical of quantitative measures for qualitative notions.
>>>
>>> Kenn
>>>
>>> On Thu, Jan 3, 2019 at 1:58 PM Heejong Lee  wrote:
>>>
>>>> I don't have any experience of using SonarQube but Coverity worked well
>>>> for me. Looks like it already has beam repo:
>>>> https://scan.coverity.com/projects/11881
>>>>
>>>> On Thu, Jan 3, 2019 at 1:27 PM Reuven Lax  wrote:
>>>>
>>>>> checkstyle and findbugs are already run as precommit checks, are they
>>>>> not?
>>>>>
>>>>> On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin 
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> In our current builds we (can) run multiple code quality checks tools
>>>>>> like checkstyle, findbugs, code test coverage via cubertura. However we 
>>>>>> do
>>>>>> not utilize many of those signals.
>>>>>>
>>>>>> I suggest to add requirements to code based on those tools.
>>>>>> Specifically, I suggest to add pre-commit checks that will require PRs to
>>>>>> conform to some quality checks.
>>>>>>
>>>>>> We can see good example of thresholds to add at Apache SonarQube
>>>>>> provided default quality gate config
>>>>>> <https://builds.apache.org/analysis/quality_gates/show/1>:
>>>>>> 80% tests coverage on new code,
>>>>>> 5% technical technical debt on new code,
>>>>>> No bugs/Vulnerabilities added.
>>>>>>
>>>>>> As another part of this proposal, I want to suggest the use of
>>>>>> SonarQube for tracking code statistics and as agent for enforcing code
>>>>>> quality thresholds. It is Apache provided tool that has integration with
>>>>>> Jenkins or Gradle via plugins.
>>>>>>
>>>>>> I believe some reporting to SonarQube was configured for mvn builds
>>>>>> of some of Beam sub-projects, but was lost during migration to gradle.
>>>>>>
>>>>>> I was looking for other options, but so far found only general
>>>>>> configs to gradle builds that will fail build if code coverage for 
>>>>>> project
>>>>>> is too low. Such approach will force us to backfill tests for all 
>>>>>> existing
>>>>>> code that can be tedious and demand learning of all legacy code that 
>>>>>> might
>>>>>> not be part of current work.
>>>>>>
>>>>>> I suggest to discuss and come to conclusion on two points in this
>>>>>> tread:
>>>>>> 1. Do we want to add code quality checks to our pre-commit jobs and
>>>>>> require them to pass before PR is merged?
>>>>>>
>>>>>> Suggested: Add code quality checks listed above at first, adjust them
>>>>>> as we see fit in the future.
>>>>>>
>>>>>> 2. What tools do we want to utilize for analyzing code quality?
>>>>>>
>>>>>> Under discussion. Suggested: SonarQube, but will depend on
>>>>>> functionality level we want to achieve.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> --Mikhail
>>>>>>
>>>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: Add code quality checks to pre-commits.

2019-01-03 Thread Heejong Lee
I don't have any experience of using SonarQube but Coverity worked well for
me. Looks like it already has beam repo:
https://scan.coverity.com/projects/11881

On Thu, Jan 3, 2019 at 1:27 PM Reuven Lax  wrote:

> checkstyle and findbugs are already run as precommit checks, are they not?
>
> On Thu, Jan 3, 2019 at 7:19 PM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> In our current builds we (can) run multiple code quality checks tools
>> like checkstyle, findbugs, code test coverage via cubertura. However we do
>> not utilize many of those signals.
>>
>> I suggest to add requirements to code based on those tools. Specifically,
>> I suggest to add pre-commit checks that will require PRs to conform to some
>> quality checks.
>>
>> We can see good example of thresholds to add at Apache SonarQube
>> provided default quality gate config
>> :
>> 80% tests coverage on new code,
>> 5% technical technical debt on new code,
>> No bugs/Vulnerabilities added.
>>
>> As another part of this proposal, I want to suggest the use of SonarQube
>> for tracking code statistics and as agent for enforcing code quality
>> thresholds. It is Apache provided tool that has integration with Jenkins or
>> Gradle via plugins.
>>
>> I believe some reporting to SonarQube was configured for mvn builds of
>> some of Beam sub-projects, but was lost during migration to gradle.
>>
>> I was looking for other options, but so far found only general configs to
>> gradle builds that will fail build if code coverage for project is too low.
>> Such approach will force us to backfill tests for all existing code that
>> can be tedious and demand learning of all legacy code that might not be
>> part of current work.
>>
>> I suggest to discuss and come to conclusion on two points in this tread:
>> 1. Do we want to add code quality checks to our pre-commit jobs and
>> require them to pass before PR is merged?
>>
>> Suggested: Add code quality checks listed above at first, adjust them as
>> we see fit in the future.
>>
>> 2. What tools do we want to utilize for analyzing code quality?
>>
>> Under discussion. Suggested: SonarQube, but will depend on functionality
>> level we want to achieve.
>>
>>
>> Regards,
>> --Mikhail
>>
>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-11-13 Thread Heejong Lee
In current PR, there will be two parameters that can control the final row
group size, row_group_buffer_size and record_batch_size. The records are
first stored as a list of columns and then transformed into a record batch
(a data structure defined in pyarrow) when the number of records in the
list reaches record_batch_size. Record batches form another list that will
be written as a single row group when the byte size of the record batch
list exceeds row_group_buffer_size. row_group_buffer_size is normally much
bigger than a row group data size in a parquet file so it's not an exact
estimation of a row group size written in a file but I guess this is the
best option we can do on the given limitation of python parquet libraries.
For better estimation of row group size in bytes, the parquet library
should provide buffered writing of a row group and a method returning the
size of encoded data in the writing buffer. No currently available python
parquet library implements these features.


On Tue, Nov 13, 2018 at 4:44 AM Robert Bradshaw  wrote:

> Was there resolution on how to handle row group size, given that it's
> hard to pick a decent default? IIRC, the ideal was to base this on
> byte sizes; will this be in v1 or will there be other parameter(s)
> that we'll have to support going forward?
> On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee  wrote:
> >
> > Thanks all for the valuable feedback on the document. Here's the summary
> of planned features for ParquetIO Python SDK:
> >
> > Can read from Parquet file on any storage system supported by Beam
> >
> > Can write to Parquet file on any storage system supported by Beam
> >
> > Can configure the compression algorithm of output files
> >
> > Can adjust the size of the row group
> >
> > Can read multiple row groups in a single file parallelly (source
> splitting)
> >
> > Can partially read by columns
> >
> >
> > It introduces new dependency pyarrow for parquet reading and writing
> operations.
> >
> > If you're interested, you can review and test the PR
> https://github.com/apache/beam/pull/6763
> >
> > Thanks,
> >
> > On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath 
> wrote:
> >>
> >> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
> >>
> >> - Cham
> >>
> >> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:
> >>>
> >>> Thank you Heejong. Could you also share a summary of the design
> document (major points/decisions) in the mailing list?
> >>>
> >>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee 
> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm working on BEAM-: Parquet IO for Python SDK.
> >>>>
> >>>> Issue: https://issues.apache.org/jira/browse/BEAM-
> >>>> Design doc:
> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
> >>>> WIP PR: https://github.com/apache/beam/pull/6763
> >>>>
> >>>> Any feedback is appreciated. Thanks!
> >>>>
> >>>
>


Re: [PROPOSAL] ParquetIO support for Python SDK

2018-10-30 Thread Heejong Lee
Thanks all for the valuable feedback on the document. Here's the summary of
planned features for ParquetIO Python SDK:

   -

   Can read from Parquet file on any storage system supported by Beam
   -

   Can write to Parquet file on any storage system supported by Beam
   -

   Can configure the compression algorithm of output files
   -

   Can adjust the size of the row group
   -

   Can read multiple row groups in a single file parallelly (source
   splitting)
   -

   Can partially read by columns


It introduces new dependency pyarrow for parquet reading and writing
operations.

If you're interested, you can review and test the PR
https://github.com/apache/beam/pull/6763

Thanks,

On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath 
wrote:

> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
>
> - Cham
>
> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay  wrote:
>
>> Thank you Heejong. Could you also share a summary of the design document
>> (major points/decisions) in the mailing list?
>>
>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee  wrote:
>>
>>> Hi,
>>>
>>> I'm working on BEAM-: Parquet IO for Python SDK.
>>>
>>> Issue: https://issues.apache.org/jira/browse/BEAM-
>>> Design doc:
>>> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>
>>> Any feedback is appreciated. Thanks!
>>>
>>>
>>


[PROPOSAL] ParquetIO support for Python SDK

2018-10-24 Thread Heejong Lee
Hi,

I'm working on BEAM-: Parquet IO for Python SDK.

Issue: https://issues.apache.org/jira/browse/BEAM-
Design doc:
https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
WIP PR: https://github.com/apache/beam/pull/6763

Any feedback is appreciated. Thanks!


a new contributor

2018-10-19 Thread Heejong Lee
Hi,

I just wanted to introduce myself as a new contributor. I'm a new member of
Apache Beam team at Google and will be working on IO modules. Happy to meet
you all!

Thanks,
Heejong