Re: Apache Beam Newsletter - February/March 2019

2019-03-11 Thread Pablo Estrada
I agree that the newsletter fits well as a blog post. I think that'd work
best.

As for the cadence, I think quarterly would be a bit too infrequent. I like
once a month, or once every other month to have at least one per release.
Though happy to hear other people's thoughts.
Best
-P.

On Mon, Mar 11, 2019 at 6:42 PM Thomas Weise  wrote:

> +1 for single blog/news section
>
> Also I wouldn't mind quarterly cadence, to provide more focus for folks to
> contribute.
>
> On Mon, Mar 11, 2019 at 6:35 PM Kenneth Knowles  wrote:
>
>> Could the newsletter be a blog entry? If you check out
>> https://blogs.apache.org/ many of the posts are "Apache News Round-up".
>> We could rename the "Blog" section to "News" if you ask me.
>>
>> Kenn
>>
>> On Mon, Mar 11, 2019 at 5:13 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Hello everyone,
>>>
>>> I had a chat with Rose on how I can support the effort and keep sending
>>> the newsletters on a monthly basis. The new workflow would look as follows:
>>>
>>>1. Send out the same [CALL FOR ITEMS] where you can contribute to
>>>the Google doc with deadlines
>>>2. Edit the doc after the deadline
>>>3. Convert the file into Markdown
>>>4. Send in PR to add the file to Beam repo in Newsletter directory
>>>5. Have people send their fixes/updates through PRs
>>>
>>> In this effort, I can support Rose in steps 3 & 4.
>>>
>>> We would also need:
>>>
>>>- Create a Newsletter section under the Community tab
>>>- Write guidelines on newsletter contributions
>>>- Make a note about timing e.g. if upcoming event, then add to the
>>>next newsletter
>>>
>>> How does that sound to you all?
>>>
>>>
>>> On Wed, Mar 6, 2019 at 9:19 PM Rose Nguyen  wrote:
>>>
 Good points. With the suggested workflow I think I can support a
 quarterly newsletter. I'm also happy to get more involvement from others to
 do this work and we can see what cadence that allows.

 On Wed, Mar 6, 2019 at 8:22 PM Kenneth Knowles  wrote:

> Good points Melissa & Austin.
>
>  - archive in the repo & on the website
>  - put missed items on the next newsletter, so anyone following sees
> them
>
> Kenn
>
> On Wed, Mar 6, 2019 at 3:26 PM Suneel Marthi 
> wrote:
>
>> I believe there was also a Beam workshop or working session in Warsaw
>> last week.
>>
>> On Wed, Mar 6, 2019 at 6:20 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> +1 for archive in our repo.
>>>
>>> I do follow the newsletter, but am unlikely to go back and look into
>>> the past for changes/updates.
>>>
>>> Would suggest that things that get missed in one newsletter (a
>>> concrete example, Suneel's talks not mentioned in the newsletter) would 
>>> get
>>> published in the next iteration, rather than editing the past 
>>> 'published'
>>> newsletter.  Put another way, save editing the past for corrections 
>>> (typos,
>>> things being incorrect).  Else, I imagine that I'm unlikely to catch a
>>> great announcement that warranted being in the newsletter in the first
>>> place.  This certainly works better with a regular/frequent release
>>> cadence, like we arrived at for version releases (then, if something 
>>> misses
>>> one cut, it is not too big a deal, as the next release is coming soon).
>>>
>>>
>>>
>>>
>>> On Wed, Mar 6, 2019 at 12:50 PM Melissa Pashniak <
>>> meliss...@google.com> wrote:
>>>

 For step #2 (publishing onto the website), I think it would be good
 to stay consistent with our existing workflows if possible. Rather than
 using an external tool, what about:

 After a google doc newsletter draft is ready, convert it into a
 standard markdown file and put it into our GitHub repo, perhaps in a 
 new
 newsletter directory in the website community directory [1]. These 
 would be
 listed for browsing on a Newsletters page as mentioned in step #4. 
 People
 can then just open a PR to add missing things to the pages later, and 
 the
 newsletter will be automatically updated on the website through our
 standard website workflow. It also avoids the potential issue of the 
 source
 google docs disappearing in the future, as they are stored in a 
 community
 location.

 [1]
 https://github.com/apache/beam/tree/master/website/src/community


 On Wed, Mar 6, 2019 at 10:36 AM Rose Nguyen 
 wrote:

> I think that would be a great idea to change formats to help with
> distribution. I'm open to suggestions! I'm currently using a Google 
> doc to
> collect and edit, then copy/paste sending the newsletter out 
> directly, based
> on an 

Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-11 Thread Raghu Angadi
Thank you all!

On Mon, Mar 11, 2019 at 6:11 AM Maximilian Michels  wrote:

> Congrats! :)
>
> On 11.03.19 14:01, Etienne Chauchot wrote:
> > Congrats ! Well deserved
> >
> > Etienne
> >
> > Le lundi 11 mars 2019 à 13:22 +0100, Alexey Romanenko a écrit :
> >> My congratulations, Raghu!
> >>
> >>> On 8 Mar 2019, at 10:39, Łukasz Gajowy  >>> > wrote:
> >>>
> >>> Congratulations! :)
> >>>
> >>> pt., 8 mar 2019 o 10:16 Gleb Kanterov  >>> > napisał(a):
>  Congratulations!
> 
>  On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey   > wrote:
> > Congrats Raghu!
> >
> > On Thu, Mar 7, 2019 at 8:06 PM Mark Liu  > > wrote:
> >> Congrats!
> >>
> >> On Thu, Mar 7, 2019 at 10:45 AM Rui Wang  >> > wrote:
> >>> Congrats Raghu!
> >>>
> >>>
> >>> -Rui
> >>>
> >>> On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise  >>> > wrote:
>  Congrats!
> 
> 
>  On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson
>  mailto:timrobertson...@gmail.com>>
>  wrote:
> > Congrats Raghu
> >
> > On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  > > wrote:
> >> Congratulations!
> >>
> >> On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang
> >> mailto:ruo...@google.com>> wrote:
> >>> Thank you Raghu for your contribution!
> >>>
> >>>
> >>>
> >>> On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan
> >>> mailto:conne...@google.com>> wrote:
>  Congratulation Raghu!!! Thank you for sharing Kenn!!!
> 
>  On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía
>  mailto:ieme...@gmail.com>> wrote:
> > Congrats !
> >
> > Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy
> > mailto:aizha...@google.com>> a écrit :
> >> Congratulations, Raghu!!!
> >> On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles
> >> mailto:k...@apache.org>> wrote:
> >>> Hi all,
> >>>
> >>> Please join me and the rest of the Beam PMC in welcoming
> >>> a new committer: Raghu Angadi
> >>>
> >>> Raghu has been contributing to Beam since early 2016! He
> >>> has continuously improved KafkaIO and supported on the
> >>> user@ list but his community contributions are even more
> >>> extensive, including reviews, dev@ list discussions,
> >>> improvements and ideas across SqsIO, FileIO, PubsubIO,
> >>> and the Dataflow and Samza runners. In consideration of
> >>> Raghu's contributions, the Beam PMC trusts Raghu with the
> >>> responsibilities of a Beam committer [1].
> >>>
> >>> Thank you, Raghu, for your contributions.
> >>>
> >>> Kenn
> >>>
> >>> [1]
> >>>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >>> <
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >
> >>>
> >>>
> >>> --
> >>> 
> >>> Ruoyun  Huang
> >>>
> 
> 
>  --
>  Cheers,
>  Gleb
> >>
>


Re: Apache Beam Newsletter - February/March 2019

2019-03-11 Thread Thomas Weise
+1 for single blog/news section

Also I wouldn't mind quarterly cadence, to provide more focus for folks to
contribute.

On Mon, Mar 11, 2019 at 6:35 PM Kenneth Knowles  wrote:

> Could the newsletter be a blog entry? If you check out
> https://blogs.apache.org/ many of the posts are "Apache News Round-up".
> We could rename the "Blog" section to "News" if you ask me.
>
> Kenn
>
> On Mon, Mar 11, 2019 at 5:13 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
>
>> Hello everyone,
>>
>> I had a chat with Rose on how I can support the effort and keep sending
>> the newsletters on a monthly basis. The new workflow would look as follows:
>>
>>1. Send out the same [CALL FOR ITEMS] where you can contribute to the
>>Google doc with deadlines
>>2. Edit the doc after the deadline
>>3. Convert the file into Markdown
>>4. Send in PR to add the file to Beam repo in Newsletter directory
>>5. Have people send their fixes/updates through PRs
>>
>> In this effort, I can support Rose in steps 3 & 4.
>>
>> We would also need:
>>
>>- Create a Newsletter section under the Community tab
>>- Write guidelines on newsletter contributions
>>- Make a note about timing e.g. if upcoming event, then add to the
>>next newsletter
>>
>> How does that sound to you all?
>>
>>
>> On Wed, Mar 6, 2019 at 9:19 PM Rose Nguyen  wrote:
>>
>>> Good points. With the suggested workflow I think I can support a
>>> quarterly newsletter. I'm also happy to get more involvement from others to
>>> do this work and we can see what cadence that allows.
>>>
>>> On Wed, Mar 6, 2019 at 8:22 PM Kenneth Knowles  wrote:
>>>
 Good points Melissa & Austin.

  - archive in the repo & on the website
  - put missed items on the next newsletter, so anyone following sees
 them

 Kenn

 On Wed, Mar 6, 2019 at 3:26 PM Suneel Marthi 
 wrote:

> I believe there was also a Beam workshop or working session in Warsaw
> last week.
>
> On Wed, Mar 6, 2019 at 6:20 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> +1 for archive in our repo.
>>
>> I do follow the newsletter, but am unlikely to go back and look into
>> the past for changes/updates.
>>
>> Would suggest that things that get missed in one newsletter (a
>> concrete example, Suneel's talks not mentioned in the newsletter) would 
>> get
>> published in the next iteration, rather than editing the past 'published'
>> newsletter.  Put another way, save editing the past for corrections 
>> (typos,
>> things being incorrect).  Else, I imagine that I'm unlikely to catch a
>> great announcement that warranted being in the newsletter in the first
>> place.  This certainly works better with a regular/frequent release
>> cadence, like we arrived at for version releases (then, if something 
>> misses
>> one cut, it is not too big a deal, as the next release is coming soon).
>>
>>
>>
>>
>> On Wed, Mar 6, 2019 at 12:50 PM Melissa Pashniak <
>> meliss...@google.com> wrote:
>>
>>>
>>> For step #2 (publishing onto the website), I think it would be good
>>> to stay consistent with our existing workflows if possible. Rather than
>>> using an external tool, what about:
>>>
>>> After a google doc newsletter draft is ready, convert it into a
>>> standard markdown file and put it into our GitHub repo, perhaps in a new
>>> newsletter directory in the website community directory [1]. These 
>>> would be
>>> listed for browsing on a Newsletters page as mentioned in step #4. 
>>> People
>>> can then just open a PR to add missing things to the pages later, and 
>>> the
>>> newsletter will be automatically updated on the website through our
>>> standard website workflow. It also avoids the potential issue of the 
>>> source
>>> google docs disappearing in the future, as they are stored in a 
>>> community
>>> location.
>>>
>>> [1] https://github.com/apache/beam/tree/master/website/src/community
>>>
>>>
>>> On Wed, Mar 6, 2019 at 10:36 AM Rose Nguyen 
>>> wrote:
>>>
 I think that would be a great idea to change formats to help with
 distribution. I'm open to suggestions! I'm currently using a Google 
 doc to
 collect and edit, then copy/paste sending the newsletter out directly, 
 based
 on an interpretation of this discussion
 
 .

 How about this doc->website->Beam site workflow?:

1. The same usual newsletter [CALL FOR ITEMS] where you can
contribute to the google doc, with soft deadlines for when I'll 
 publish.
2. I'll publish the doc itself onto a website.
3. 

Re: Apache Beam Newsletter - February/March 2019

2019-03-11 Thread Kenneth Knowles
Could the newsletter be a blog entry? If you check out
https://blogs.apache.org/ many of the posts are "Apache News Round-up". We
could rename the "Blog" section to "News" if you ask me.

Kenn

On Mon, Mar 11, 2019 at 5:13 PM Aizhamal Nurmamat kyzy 
wrote:

> Hello everyone,
>
> I had a chat with Rose on how I can support the effort and keep sending
> the newsletters on a monthly basis. The new workflow would look as follows:
>
>1. Send out the same [CALL FOR ITEMS] where you can contribute to the
>Google doc with deadlines
>2. Edit the doc after the deadline
>3. Convert the file into Markdown
>4. Send in PR to add the file to Beam repo in Newsletter directory
>5. Have people send their fixes/updates through PRs
>
> In this effort, I can support Rose in steps 3 & 4.
>
> We would also need:
>
>- Create a Newsletter section under the Community tab
>- Write guidelines on newsletter contributions
>- Make a note about timing e.g. if upcoming event, then add to the
>next newsletter
>
> How does that sound to you all?
>
>
> On Wed, Mar 6, 2019 at 9:19 PM Rose Nguyen  wrote:
>
>> Good points. With the suggested workflow I think I can support a
>> quarterly newsletter. I'm also happy to get more involvement from others to
>> do this work and we can see what cadence that allows.
>>
>> On Wed, Mar 6, 2019 at 8:22 PM Kenneth Knowles  wrote:
>>
>>> Good points Melissa & Austin.
>>>
>>>  - archive in the repo & on the website
>>>  - put missed items on the next newsletter, so anyone following sees them
>>>
>>> Kenn
>>>
>>> On Wed, Mar 6, 2019 at 3:26 PM Suneel Marthi  wrote:
>>>
 I believe there was also a Beam workshop or working session in Warsaw
 last week.

 On Wed, Mar 6, 2019 at 6:20 PM Austin Bennett <
 whatwouldausti...@gmail.com> wrote:

> +1 for archive in our repo.
>
> I do follow the newsletter, but am unlikely to go back and look into
> the past for changes/updates.
>
> Would suggest that things that get missed in one newsletter (a
> concrete example, Suneel's talks not mentioned in the newsletter) would 
> get
> published in the next iteration, rather than editing the past 'published'
> newsletter.  Put another way, save editing the past for corrections 
> (typos,
> things being incorrect).  Else, I imagine that I'm unlikely to catch a
> great announcement that warranted being in the newsletter in the first
> place.  This certainly works better with a regular/frequent release
> cadence, like we arrived at for version releases (then, if something 
> misses
> one cut, it is not too big a deal, as the next release is coming soon).
>
>
>
>
> On Wed, Mar 6, 2019 at 12:50 PM Melissa Pashniak 
> wrote:
>
>>
>> For step #2 (publishing onto the website), I think it would be good
>> to stay consistent with our existing workflows if possible. Rather than
>> using an external tool, what about:
>>
>> After a google doc newsletter draft is ready, convert it into a
>> standard markdown file and put it into our GitHub repo, perhaps in a new
>> newsletter directory in the website community directory [1]. These would 
>> be
>> listed for browsing on a Newsletters page as mentioned in step #4. People
>> can then just open a PR to add missing things to the pages later, and the
>> newsletter will be automatically updated on the website through our
>> standard website workflow. It also avoids the potential issue of the 
>> source
>> google docs disappearing in the future, as they are stored in a community
>> location.
>>
>> [1] https://github.com/apache/beam/tree/master/website/src/community
>>
>>
>> On Wed, Mar 6, 2019 at 10:36 AM Rose Nguyen 
>> wrote:
>>
>>> I think that would be a great idea to change formats to help with
>>> distribution. I'm open to suggestions! I'm currently using a Google doc 
>>> to
>>> collect and edit, then copy/paste sending the newsletter out directly, 
>>> based
>>> on an interpretation of this discussion
>>> 
>>> .
>>>
>>> How about this doc->website->Beam site workflow?:
>>>
>>>1. The same usual newsletter [CALL FOR ITEMS] where you can
>>>contribute to the google doc, with soft deadlines for when I'll 
>>> publish.
>>>2. I'll publish the doc itself onto a website.
>>>3. The newsletter is mailed out in the same way, but now with a
>>>shareable website link.
>>>4. We'll keep an index of archived newsletter web pages on the
>>>Beam site, under the Community tab.
>>>5. If you want to submit more content after the soft deadline,
>>>add it to the google doc and let me know to republish. I don't want 
>>> to make

Re: Apache Beam Newsletter - February/March 2019

2019-03-11 Thread Aizhamal Nurmamat kyzy
Hello everyone,

I had a chat with Rose on how I can support the effort and keep sending the
newsletters on a monthly basis. The new workflow would look as follows:

   1. Send out the same [CALL FOR ITEMS] where you can contribute to the
   Google doc with deadlines
   2. Edit the doc after the deadline
   3. Convert the file into Markdown
   4. Send in PR to add the file to Beam repo in Newsletter directory
   5. Have people send their fixes/updates through PRs

In this effort, I can support Rose in steps 3 & 4.

We would also need:

   - Create a Newsletter section under the Community tab
   - Write guidelines on newsletter contributions
   - Make a note about timing e.g. if upcoming event, then add to the next
   newsletter

How does that sound to you all?


On Wed, Mar 6, 2019 at 9:19 PM Rose Nguyen  wrote:

> Good points. With the suggested workflow I think I can support a quarterly
> newsletter. I'm also happy to get more involvement from others to do this
> work and we can see what cadence that allows.
>
> On Wed, Mar 6, 2019 at 8:22 PM Kenneth Knowles  wrote:
>
>> Good points Melissa & Austin.
>>
>>  - archive in the repo & on the website
>>  - put missed items on the next newsletter, so anyone following sees them
>>
>> Kenn
>>
>> On Wed, Mar 6, 2019 at 3:26 PM Suneel Marthi  wrote:
>>
>>> I believe there was also a Beam workshop or working session in Warsaw
>>> last week.
>>>
>>> On Wed, Mar 6, 2019 at 6:20 PM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 +1 for archive in our repo.

 I do follow the newsletter, but am unlikely to go back and look into
 the past for changes/updates.

 Would suggest that things that get missed in one newsletter (a concrete
 example, Suneel's talks not mentioned in the newsletter) would get
 published in the next iteration, rather than editing the past 'published'
 newsletter.  Put another way, save editing the past for corrections (typos,
 things being incorrect).  Else, I imagine that I'm unlikely to catch a
 great announcement that warranted being in the newsletter in the first
 place.  This certainly works better with a regular/frequent release
 cadence, like we arrived at for version releases (then, if something misses
 one cut, it is not too big a deal, as the next release is coming soon).




 On Wed, Mar 6, 2019 at 12:50 PM Melissa Pashniak 
 wrote:

>
> For step #2 (publishing onto the website), I think it would be good to
> stay consistent with our existing workflows if possible. Rather than using
> an external tool, what about:
>
> After a google doc newsletter draft is ready, convert it into a
> standard markdown file and put it into our GitHub repo, perhaps in a new
> newsletter directory in the website community directory [1]. These would 
> be
> listed for browsing on a Newsletters page as mentioned in step #4. People
> can then just open a PR to add missing things to the pages later, and the
> newsletter will be automatically updated on the website through our
> standard website workflow. It also avoids the potential issue of the 
> source
> google docs disappearing in the future, as they are stored in a community
> location.
>
> [1] https://github.com/apache/beam/tree/master/website/src/community
>
>
> On Wed, Mar 6, 2019 at 10:36 AM Rose Nguyen 
> wrote:
>
>> I think that would be a great idea to change formats to help with
>> distribution. I'm open to suggestions! I'm currently using a Google doc 
>> to
>> collect and edit, then copy/paste sending the newsletter out directly, 
>> based
>> on an interpretation of this discussion
>> 
>> .
>>
>> How about this doc->website->Beam site workflow?:
>>
>>1. The same usual newsletter [CALL FOR ITEMS] where you can
>>contribute to the google doc, with soft deadlines for when I'll 
>> publish.
>>2. I'll publish the doc itself onto a website.
>>3. The newsletter is mailed out in the same way, but now with a
>>shareable website link.
>>4. We'll keep an index of archived newsletter web pages on the
>>Beam site, under the Community tab.
>>5. If you want to submit more content after the soft deadline,
>>add it to the google doc and let me know to republish. I don't want 
>> to make
>>the publication changes automatic because that leaves us open to 
>> tampering.
>>
>>
>> This process is more laggy, so I'd suggest doing a 2 month vs monthly
>> newsletter cadence. If we're happy with this idea, I'll send in a website
>> PR for a new "Newsletter" left nav item under Community.
>>
>> Here's an example of a published newsletter: Apache Beam
>> 

JIRA hygiene

2019-03-11 Thread Thomas Weise
JIRA probably deserves a separate discussion. It is messy.. We also have
examples of tickets being referenced by users that were not closed,
although the feature long implemented or issue fixed.

There is no clear ownership in our workflow.

A while ago I proposed in another context to make resolving JIRA part of
committer duty. I would like to bring this up for discussion again:

https://github.com/apache/beam/pull/7129#discussion_r236405202

Thomas


On Mon, Mar 11, 2019 at 1:47 PM Ahmet Altay  wrote:

> I agree this is a good idea. I used the same technique for 2.11 blog post
> (JIRA release notes -> editorialized list + diffed the dependencies).
>
> On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  wrote:
>
>> That is a good idea. The blog post is probably the main avenue where
>> folks will find out about new features or big fixes.
>>
>> When I did 2.10.0 I just used the automated Jira release notes and pulled
>> out significant things based on my judgment. I would also suggest that our
>> Jira hygiene could be significantly improved to make this process more
>> effective.
>>
>
> +1 to improving JIRA notes as well. Often times issues are closed with no
> real comments on what happened, has it been resolved or not. It becomes an
> exercise on reading the linked PRs to figure out what happened.
>
>
>>
>> Kenn
>>
>> On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  wrote:
>>
>>> Ahmet, thanks managing the release!
>>>
>>> I have a suggestion (not specific to only this release):
>>>
>>> The release blogs could be more useful to users. In this case, we have a
>>> long list of dependency updates on the top, but probably the improvements
>>> and features section should come first. I was also very surprised to find
>>> "Portable Flink runner support for running cross-language transforms."
>>> mentioned, since that is only being worked on now. On the other hand, there
>>> are probably items that we miss.
>>>
>>> Since this can only be addressed by more eyes, I suggest that going
>>> forward the blog pull request is included and reviewed as part of the
>>> release vote.
>>>
>>> Also, we should make announcing the release on Twitter part of the
>>> process.
>>>
>>
> This is actually part of the release process (
> https://beam.apache.org/contribute/release-guide/#social-media). I missed
> it for 2.11. I will send an announcement on Twitter shortly.
>
>
>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Mon, Mar 11, 2019 at 10:46 AM Ahmet Altay  wrote:
>>>
 I updated the JIRAs for these two PRs to set the fix version correctly
 as 2.12.0. That should fix the release notes issue.

 On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:

> Hi Etienne,
>
> I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST
> -- github timestamp). Release tag, as you pointed out, points to a commit
> on Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.
>
> After cutting the release branch, I only merged cherry picks from
> master to the release branch if a JIRA was tagged as a release blocker and
> there was a PR to fix that specific issue. In case of these two PRs, they
> were merged at Feb 20 and Feb 18 respectively. They were not included in
> the branch cut and I did not cherry pick them either. I apologize if I
> missed a request to cherry pick these PRs.
>
> Does this answer your question?
>
> Ahmet
>
> [1]
> https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0
>
> On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
> wrote:
>
>> @Ahmet sorry I did not have time to check 2.11 release but a fellow
>> Beam contributor drew my attention on something:
>>
>> the 2.11 release tag points on commit of 02/26 and this[1] PR was
>> merged 02/20 and that [2] PR was merged on 02/18. So, both commits should
>> be in the released code but they are not.
>>
>> [1] https://github.com/apache/beam/pull/7348
>> [2] https://github.com/apache/beam/pull/7751
>>
>> So at least for those 2 features the release notes do not comply to
>> content of the release. Is there a real problem or did I miss something ?
>>
>> Etienne
>>
>> Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
>>
>> Thank you for the additional votes and validations.
>>
>> Update: Binaries are pushed. Website updates are blocked on an issue
>> that is preventing beam-site changes to be synced the beam website.
>> (INFRA-17953). I am waiting for that to be resolved before sending an
>> announcement.
>>
>> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
>> wrote:
>>
>> I see the vote has passed, but +1 (binding) from me as well.
>>
>> On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > +1 (binding)
>> >
>> > Tested with beam-samples.
>> >
>> > Regards

Re: New contributor to Beam

2019-03-11 Thread Kenneth Knowles
Welcome!

On Mon, Mar 11, 2019 at 12:22 PM Melissa Pashniak 
wrote:

>
> Welcome!
>
>
> On Mon, Mar 11, 2019 at 12:16 PM Suneel Marthi 
> wrote:
>
>> Welcome Aizhamal
>>
>> Sent from my iPhone
>>
>> On Mar 11, 2019, at 2:08 PM, Rose Nguyen  wrote:
>>
>> Welcome, Aizhamal!
>>
>> On Mon, Mar 11, 2019 at 10:55 AM Ahmet Altay  wrote:
>>
>>> Welcome!
>>>
>>> On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:
>>>
 Done, welcome !

 On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
  wrote:
 >
 > Hello everyone!
 >
 > My name is Aizhamal and I would like to start contributing to Beam.
 Can anyone add me as a contributor for Beam's Jira issue tracker? I would
 like to create and assign tickets.
 >
 > My jira username is aizhamal.
 >
 > Thanks and excited to be part of this community!
 > Aizhamal

>>>
>>
>> --
>> Rose Thị Nguyễn
>>
>>


Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Ahmet Altay
I agree this is a good idea. I used the same technique for 2.11 blog post
(JIRA release notes -> editorialized list + diffed the dependencies).

On Mon, Mar 11, 2019 at 1:40 PM Kenneth Knowles  wrote:

> That is a good idea. The blog post is probably the main avenue where folks
> will find out about new features or big fixes.
>
> When I did 2.10.0 I just used the automated Jira release notes and pulled
> out significant things based on my judgment. I would also suggest that our
> Jira hygiene could be significantly improved to make this process more
> effective.
>

+1 to improving JIRA notes as well. Often times issues are closed with no
real comments on what happened, has it been resolved or not. It becomes an
exercise on reading the linked PRs to figure out what happened.


>
> Kenn
>
> On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  wrote:
>
>> Ahmet, thanks managing the release!
>>
>> I have a suggestion (not specific to only this release):
>>
>> The release blogs could be more useful to users. In this case, we have a
>> long list of dependency updates on the top, but probably the improvements
>> and features section should come first. I was also very surprised to find
>> "Portable Flink runner support for running cross-language transforms."
>> mentioned, since that is only being worked on now. On the other hand, there
>> are probably items that we miss.
>>
>> Since this can only be addressed by more eyes, I suggest that going
>> forward the blog pull request is included and reviewed as part of the
>> release vote.
>>
>> Also, we should make announcing the release on Twitter part of the
>> process.
>>
>
This is actually part of the release process (
https://beam.apache.org/contribute/release-guide/#social-media). I missed
it for 2.11. I will send an announcement on Twitter shortly.


>
>> Thanks,
>> Thomas
>>
>>
>> On Mon, Mar 11, 2019 at 10:46 AM Ahmet Altay  wrote:
>>
>>> I updated the JIRAs for these two PRs to set the fix version correctly
>>> as 2.12.0. That should fix the release notes issue.
>>>
>>> On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:
>>>
 Hi Etienne,

 I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST
 -- github timestamp). Release tag, as you pointed out, points to a commit
 on Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.

 After cutting the release branch, I only merged cherry picks from
 master to the release branch if a JIRA was tagged as a release blocker and
 there was a PR to fix that specific issue. In case of these two PRs, they
 were merged at Feb 20 and Feb 18 respectively. They were not included in
 the branch cut and I did not cherry pick them either. I apologize if I
 missed a request to cherry pick these PRs.

 Does this answer your question?

 Ahmet

 [1]
 https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0

 On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
 wrote:

> @Ahmet sorry I did not have time to check 2.11 release but a fellow
> Beam contributor drew my attention on something:
>
> the 2.11 release tag points on commit of 02/26 and this[1] PR was
> merged 02/20 and that [2] PR was merged on 02/18. So, both commits should
> be in the released code but they are not.
>
> [1] https://github.com/apache/beam/pull/7348
> [2] https://github.com/apache/beam/pull/7751
>
> So at least for those 2 features the release notes do not comply to
> content of the release. Is there a real problem or did I miss something ?
>
> Etienne
>
> Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
>
> Thank you for the additional votes and validations.
>
> Update: Binaries are pushed. Website updates are blocked on an issue
> that is preventing beam-site changes to be synced the beam website.
> (INFRA-17953). I am waiting for that to be resolved before sending an
> announcement.
>
> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
> wrote:
>
> I see the vote has passed, but +1 (binding) from me as well.
>
> On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
> wrote:
> >
> > +1 (binding)
> >
> > Tested with beam-samples.
> >
> > Regards
> > JB
> >
> > On 26/02/2019 10:40, Ahmet Altay wrote:
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > > 2.11.0, as follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific
> comments)
> > >
> > > The complete staging area is available for your review, which
> includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release to be deployed to
> dist.apache.org
> > >  [2], which is signed with the key with
> > 

Re: Executing gradlew build

2019-03-11 Thread Ahmet Altay
On Mon, Mar 11, 2019 at 7:03 AM Michael Luckey  wrote:

>
>
> On Mon, Mar 11, 2019 at 3:51 AM Kenneth Knowles  wrote:
>
>> I have never actually tried to run a full build recently. It takes a long
>> time and usually isn't what is needed for a particular change. FWIW I view
>> Beam at this point as a mini-monorepo, so each directory and target can be
>> healthy/unhealthy on its own.
>>
>
> Fair Point. Totally agree.
>
>
>>
>> But it does seem like we should at least know what is unhealthy and why.
>> Have you been filing Jiras about the failures? Are they predictable? Are
>> they targets that pass in Jenkins but not in vanilla build? That would mean
>> our Jenkins environment is too rich and should be slimmed down probably.
>>
>> Kenn
>>
>
> Unfortunately those failures are not really predictable. Usually, I start
> with plain './gradlew build' and keep adding some '-x
> :beam-sdks-python:testPy2Gcp -x :beam-sdks-python:testPython' until build
> succeeds. Afterwards it seems to be possible to remove this exclusions step
> by step, thereby filling the build cache, which on next reruns might have
> some impact on how tasks are executed.
>
> Most of failures are python related. Had not much success getting into
> those. From time to time I see 'seemingly' similar failures on Jenkins, but
> tracing on python is more difficult coming from the more java background.
> Also using Mac I believe to remember that preinstalled python had some
> issues/differences compared with private installs. Others are those
> Elasticsearch - which were worked on lately - and ClickHouse tests which
> seem to be still flaky.
>

> So I mainly blamed my setup and did not yet have the time to further track
> those failures down. But
>
> As I did use a vanilla system and was not able to get beam to build, i got
> thinking about
>
> 1. The release process
> The release manager has lot of stuff to take care for, but is also
> supposed to run a full gradle build on her local machine [1]. Apart from
> that being a long lasting task, if it keeps failing this puts additional
> burden on the release manager. So I was wondering, why we can not push that
> to Jenkins as we do with all these other tests [2]. Here I did not find any
> existing Job doing such, so wanted to ask for feedback here.
>
> If a full build has to be run - and of course it makes some sense on
> release - I would suggest to get that running on a regular base on Jenkins
> just to ensure not to be surprised during release. And as a sideeffect
> enable the release manager to also delegate this to Jenkins to free her
> time (and dev box).
>

+1, this will be great. Quite often we end up catching issues right when we
are doing the release. I would one up this request and suggest a Jenkins
job running most of the release process as much as possible to avoid last
minute surprises.

Also, I wonder if we could build our releases in a controlled environment
(e.g. container), that way the release would be more reproducible and the
release manager would have to spend less time setting their environment
(e.g. environment for building python in your case).


>
> Until now I had not yet have the time to investigate, whether all this
> pre/Post/elseJobs cover all our tests on all modules. Hoped someone else
> could point to the list of jobs which cover all tests.
>
> 2. Those newcomers
> As a naive newcomer to a project I usually deal with those jars. After
> problems arose, I might download corresponding sources and try to
> investigate. First thing I usually do here is a make/build/install to get
> those private builds in. Also on our contributor site [3] we recommend
> ensuring to be able to run all tests with 'gradlew check' which is not to
> far away from full build. Probably no one would expect a full build to fail
> here, which again makes me think we need an equivalent Job on Jenkins here?
>
> Of course it is also fully reasonable to not support this full build in
> line with that mini-monorepo.
>
> On the other hand, after filling the build cache - and enable more task
> for caching - a full build should not be to expensive in general. Although
> I d prefer some more granularity like e.g. 'gradlew -p sdks/java build' to
> build all java, which, if I understand correctly, would be a side effect of
> fixing [4].
>
> [1]
> https://github.com/apache/beam/blob/master/release/src/main/scripts/verify_release_build.sh#L142
> [2]
> https://github.com/apache/beam/blob/master/release/src/main/scripts/verify_release_build.sh#L168-L190
> [3] https://beam.apache.org/contribute/
> [4] https://issues.apache.org/jira/browse/BEAM-4046
>
>
>>
>> On Sun, Mar 10, 2019 at 7:05 PM Michael Luckey 
>> wrote:
>>
>>> Hi,
>>>
>>> while fiddling with beams release process and trying to get that working
>>> on my machine, I again stumbled about 'gradlew build'
>>>
>>> Till now I am unable to get a full build run successfully on my machine.
>>> Keeps failing. After setting up a clean docker instance and trying 

Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Kenneth Knowles
That is a good idea. The blog post is probably the main avenue where folks
will find out about new features or big fixes.

When I did 2.10.0 I just used the automated Jira release notes and pulled
out significant things based on my judgment. I would also suggest that our
Jira hygiene could be significantly improved to make this process more
effective.

Kenn

On Mon, Mar 11, 2019 at 1:04 PM Thomas Weise  wrote:

> Ahmet, thanks managing the release!
>
> I have a suggestion (not specific to only this release):
>
> The release blogs could be more useful to users. In this case, we have a
> long list of dependency updates on the top, but probably the improvements
> and features section should come first. I was also very surprised to find
> "Portable Flink runner support for running cross-language transforms."
> mentioned, since that is only being worked on now. On the other hand, there
> are probably items that we miss.
>
> Since this can only be addressed by more eyes, I suggest that going
> forward the blog pull request is included and reviewed as part of the
> release vote.
>
> Also, we should make announcing the release on Twitter part of the process.
>
> Thanks,
> Thomas
>
>
> On Mon, Mar 11, 2019 at 10:46 AM Ahmet Altay  wrote:
>
>> I updated the JIRAs for these two PRs to set the fix version correctly as
>> 2.12.0. That should fix the release notes issue.
>>
>> On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:
>>
>>> Hi Etienne,
>>>
>>> I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST --
>>> github timestamp). Release tag, as you pointed out, points to a commit on
>>> Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.
>>>
>>> After cutting the release branch, I only merged cherry picks from master
>>> to the release branch if a JIRA was tagged as a release blocker and there
>>> was a PR to fix that specific issue. In case of these two PRs, they were
>>> merged at Feb 20 and Feb 18 respectively. They were not included in the
>>> branch cut and I did not cherry pick them either. I apologize if I missed a
>>> request to cherry pick these PRs.
>>>
>>> Does this answer your question?
>>>
>>> Ahmet
>>>
>>> [1]
>>> https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0
>>>
>>> On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
>>> wrote:
>>>
 @Ahmet sorry I did not have time to check 2.11 release but a fellow
 Beam contributor drew my attention on something:

 the 2.11 release tag points on commit of 02/26 and this[1] PR was
 merged 02/20 and that [2] PR was merged on 02/18. So, both commits should
 be in the released code but they are not.

 [1] https://github.com/apache/beam/pull/7348
 [2] https://github.com/apache/beam/pull/7751

 So at least for those 2 features the release notes do not comply to
 content of the release. Is there a real problem or did I miss something ?

 Etienne

 Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :

 Thank you for the additional votes and validations.

 Update: Binaries are pushed. Website updates are blocked on an issue
 that is preventing beam-site changes to be synced the beam website.
 (INFRA-17953). I am waiting for that to be resolved before sending an
 announcement.

 On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
 wrote:

 I see the vote has passed, but +1 (binding) from me as well.

 On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
 wrote:
 >
 > +1 (binding)
 >
 > Tested with beam-samples.
 >
 > Regards
 > JB
 >
 > On 26/02/2019 10:40, Ahmet Altay wrote:
 > > Hi everyone,
 > >
 > > Please review and vote on the release candidate #2 for the version
 > > 2.11.0, as follows:
 > >
 > > [ ] +1, Approve the release
 > > [ ] -1, Do not approve the release (please provide specific
 comments)
 > >
 > > The complete staging area is available for your review, which
 includes:
 > > * JIRA release notes [1],
 > > * the official Apache source release to be deployed to
 dist.apache.org
 > >  [2], which is signed with the key with
 > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
 > > * all artifacts to be deployed to the Maven Central Repository [4],
 > > * source code tag "v2.11.0-RC2" [5],
 > > * website pull request listing the release [6] and publishing the
 API
 > > reference manual [7].
 > > * Python artifacts are deployed along with the source release to the
 > > dist.apache.org  [2].
 > > * Validation sheet with a tab for 2.11.0 release to help with
 validation
 > > [8].
 > >
 > > The vote will be open for at least 72 hours. It is adopted by
 majority
 > > approval, with at least 3 PMC affirmative votes.
 > >
 > > Thanks,
 > > Ahmet
 > >

Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Thomas Weise
Ahmet, thanks managing the release!

I have a suggestion (not specific to only this release):

The release blogs could be more useful to users. In this case, we have a
long list of dependency updates on the top, but probably the improvements
and features section should come first. I was also very surprised to find
"Portable Flink runner support for running cross-language transforms."
mentioned, since that is only being worked on now. On the other hand, there
are probably items that we miss.

Since this can only be addressed by more eyes, I suggest that going forward
the blog pull request is included and reviewed as part of the release vote.

Also, we should make announcing the release on Twitter part of the process.

Thanks,
Thomas


On Mon, Mar 11, 2019 at 10:46 AM Ahmet Altay  wrote:

> I updated the JIRAs for these two PRs to set the fix version correctly as
> 2.12.0. That should fix the release notes issue.
>
> On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:
>
>> Hi Etienne,
>>
>> I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST --
>> github timestamp). Release tag, as you pointed out, points to a commit on
>> Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.
>>
>> After cutting the release branch, I only merged cherry picks from master
>> to the release branch if a JIRA was tagged as a release blocker and there
>> was a PR to fix that specific issue. In case of these two PRs, they were
>> merged at Feb 20 and Feb 18 respectively. They were not included in the
>> branch cut and I did not cherry pick them either. I apologize if I missed a
>> request to cherry pick these PRs.
>>
>> Does this answer your question?
>>
>> Ahmet
>>
>> [1]
>> https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0
>>
>> On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
>> wrote:
>>
>>> @Ahmet sorry I did not have time to check 2.11 release but a fellow Beam
>>> contributor drew my attention on something:
>>>
>>> the 2.11 release tag points on commit of 02/26 and this[1] PR was merged
>>> 02/20 and that [2] PR was merged on 02/18. So, both commits should be in
>>> the released code but they are not.
>>>
>>> [1] https://github.com/apache/beam/pull/7348
>>> [2] https://github.com/apache/beam/pull/7751
>>>
>>> So at least for those 2 features the release notes do not comply to
>>> content of the release. Is there a real problem or did I miss something ?
>>>
>>> Etienne
>>>
>>> Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
>>>
>>> Thank you for the additional votes and validations.
>>>
>>> Update: Binaries are pushed. Website updates are blocked on an issue
>>> that is preventing beam-site changes to be synced the beam website.
>>> (INFRA-17953). I am waiting for that to be resolved before sending an
>>> announcement.
>>>
>>> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
>>> wrote:
>>>
>>> I see the vote has passed, but +1 (binding) from me as well.
>>>
>>> On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > Tested with beam-samples.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> > On 26/02/2019 10:40, Ahmet Altay wrote:
>>> > > Hi everyone,
>>> > >
>>> > > Please review and vote on the release candidate #2 for the version
>>> > > 2.11.0, as follows:
>>> > >
>>> > > [ ] +1, Approve the release
>>> > > [ ] -1, Do not approve the release (please provide specific comments)
>>> > >
>>> > > The complete staging area is available for your review, which
>>> includes:
>>> > > * JIRA release notes [1],
>>> > > * the official Apache source release to be deployed to
>>> dist.apache.org
>>> > >  [2], which is signed with the key with
>>> > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
>>> > > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > > * source code tag "v2.11.0-RC2" [5],
>>> > > * website pull request listing the release [6] and publishing the API
>>> > > reference manual [7].
>>> > > * Python artifacts are deployed along with the source release to the
>>> > > dist.apache.org  [2].
>>> > > * Validation sheet with a tab for 2.11.0 release to help with
>>> validation
>>> > > [8].
>>> > >
>>> > > The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>> > > approval, with at least 3 PMC affirmative votes.
>>> > >
>>> > > Thanks,
>>> > > Ahmet
>>> > >
>>> > > [1]
>>> > >
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344775
>>> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
>>> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> > > [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1064/
>>> > > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
>>> > > [6] https://github.com/apache/beam/pull/7924
>>> > > [7] https://github.com/apache/beam-site/pull/587
>>> > > [8]
>>> > >
>>> 

Re: New contributor to Beam

2019-03-11 Thread Melissa Pashniak
Welcome!


On Mon, Mar 11, 2019 at 12:16 PM Suneel Marthi 
wrote:

> Welcome Aizhamal
>
> Sent from my iPhone
>
> On Mar 11, 2019, at 2:08 PM, Rose Nguyen  wrote:
>
> Welcome, Aizhamal!
>
> On Mon, Mar 11, 2019 at 10:55 AM Ahmet Altay  wrote:
>
>> Welcome!
>>
>> On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:
>>
>>> Done, welcome !
>>>
>>> On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
>>>  wrote:
>>> >
>>> > Hello everyone!
>>> >
>>> > My name is Aizhamal and I would like to start contributing to Beam.
>>> Can anyone add me as a contributor for Beam's Jira issue tracker? I would
>>> like to create and assign tickets.
>>> >
>>> > My jira username is aizhamal.
>>> >
>>> > Thanks and excited to be part of this community!
>>> > Aizhamal
>>>
>>
>
> --
> Rose Thị Nguyễn
>
>


Re: New contributor to Beam

2019-03-11 Thread Connell O'Callaghan
Hi Aizhamal welcome!!

On Mon, Mar 11, 2019 at 11:08 AM Rose Nguyen  wrote:

> Welcome, Aizhamal!
>
> On Mon, Mar 11, 2019 at 10:55 AM Ahmet Altay  wrote:
>
>> Welcome!
>>
>> On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:
>>
>>> Done, welcome !
>>>
>>> On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
>>>  wrote:
>>> >
>>> > Hello everyone!
>>> >
>>> > My name is Aizhamal and I would like to start contributing to Beam.
>>> Can anyone add me as a contributor for Beam's Jira issue tracker? I would
>>> like to create and assign tickets.
>>> >
>>> > My jira username is aizhamal.
>>> >
>>> > Thanks and excited to be part of this community!
>>> > Aizhamal
>>>
>>
>
> --
> Rose Thị Nguyễn
>


Re: New contributor to Beam

2019-03-11 Thread Rose Nguyen
Welcome, Aizhamal!

On Mon, Mar 11, 2019 at 10:55 AM Ahmet Altay  wrote:

> Welcome!
>
> On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:
>
>> Done, welcome !
>>
>> On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
>>  wrote:
>> >
>> > Hello everyone!
>> >
>> > My name is Aizhamal and I would like to start contributing to Beam. Can
>> anyone add me as a contributor for Beam's Jira issue tracker? I would like
>> to create and assign tickets.
>> >
>> > My jira username is aizhamal.
>> >
>> > Thanks and excited to be part of this community!
>> > Aizhamal
>>
>

-- 
Rose Thị Nguyễn


Re: Cross-language transform API

2019-03-11 Thread Maximilian Michels
Thanks for the remarks. Correct, we do not need the static URN at all in 
the payload. We can pass the transform URN with the ExternalTransform as 
part of the ExpansionRequest. So this is sufficient for the Proto:


message ConfigValue {
  string coder_urn = 1;
  bytes payload = 2;
}

message ExternalTransformPayload {
  map configuration = 1;
}


Considering Schemas, I'm not sure they are useful for the scope of the 
PR. I think basic Java Reflection is enough.


Thanks,
Max

On 11.03.19 18:36, Robert Bradshaw wrote:

On Mon, Mar 11, 2019 at 6:05 PM Chamikara Jayalath  wrote:


On Mon, Mar 11, 2019 at 9:27 AM Robert Bradshaw  wrote:


On Mon, Mar 11, 2019 at 4:37 PM Maximilian Michels  wrote:



Just to clarify. What's the reason for including a PROPERTIES enum here instead 
of directly making beam_urn a field of ExternalTransformPayload ?


The URN is supposed to be static. We always use the same URN for this
type of external transform. We probably want an additional identifier to
point to the resource we want to configure.


It does feel odd to not use the URN to specify the transform itself,
and embed the true identity in an inner proto. The notion of
"external" is just how it happens to be invoked in this pipeline, not
part of its intrinsic definition. As we want introspection
capabilities in the service, we should be able to use the URN at a top
level and know what kind of payload it expects. I would also like to
see this kind of information populated for non-extern transforms which
could be good for visibility (substitution, visualization, etc.) for
runners and other pipeline-consuming tools.


Like so:

message ExternalTransformPayload {
enum Enum {
  PROPERTIES = 0
  [(beam_urn) = "beam:external:transform:external_transform:v1"];
}
// A fully-qualified identifier, e.g. Java package + class
string identifier = 1;


I'd rather the identifier have semantic rather than
implementation-specific meaning. e.g. one could imagine multiple
implementations of a given transform that different services could
offer.


// the format may change to map if types are supported
map parameters = 2;
}

The identifier could also be a URN.


Can we change first version to map ? Otherwise the set of 
transforms we can support/test will be very limited.


How do we do that? Do we define a set of standard coders for supported
types? On the Java side we can lookup the coder by extracting the field
from the Pojo, but we can't do that in Python.



I'll let Reuven comment on exact relevance and timelines on Beam Schema related 
work here but till we have that probably we can support the standard set of 
coders that are well defined here ?
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L542

So in Python side the ExternalTransform can take a list of parameters (of types 
that have standard coders) which will be converted to bytes to be sent over the 
wire. In Java side corresponding standard coders (which are determined by 
introspection of transform builder's payload POJO) can be used to covert bytes 
to objects.


They also need to agree on the field types as well as names, so would
it be map>. I'm not sure the tradeoff
between going further down this road vs. getting schemas up to par in
Python (and, next, Go). And supporting this long term in parallel to
what we come up with schemas.


Hopefully Beam schema work will give us a more generalized way to convert objects across 
languages (for example, Python object -> Python Row + Schema -> Java Row + Schema 
-> Java object). Note that we run into the same issue when data tries to cross SDK 
boundaries when executing cross-language pipelines.


+1, which is another reason I want to accelerate the language
independence of schemas.


Can we re-use some of the Beam schemas-related work/utilities here ?


Yes, that was the plan.


On this note, Reuven, what is the plan (and timeline) for a
language-independent representation of schemas? The crux of the
problem is that the user needs to specify some kind of configuration
(call it C) to construct the transform (call it T). This would be
handled by a TransformBuilder that provides (at least) a mapping
C -> T. (Possibly this interface could be offered on the transform
itself).

The question we are trying to answer here is how to represent C, in
both the source and target language, and on the wire. The idea is that
we could leverage the schema infrastructure such that C could be a
POJO in Java (and perhaps a dict in Python). We would want to extend
Schemas and Row (or perhaps a sub/super/sibling class thereof) to
allow for Coder and UDF-typed fields. (Exactly how to represent UDFs
is still very TBD.) The payload for a external transform using this
format would be the tuple (schema, SchemaCoder(schema).encode(C)). The
goal is to not, yet again, invent a cross-language way of defining a
bag of named, typed parameters (aka fields) with language-idiomatic

Re: New contributor to Beam

2019-03-11 Thread Ahmet Altay
Welcome!

On Fri, Mar 8, 2019 at 3:25 PM Ismaël Mejía  wrote:

> Done, welcome !
>
> On Fri, Mar 8, 2019 at 11:03 PM Aizhamal Nurmamat kyzy
>  wrote:
> >
> > Hello everyone!
> >
> > My name is Aizhamal and I would like to start contributing to Beam. Can
> anyone add me as a contributor for Beam's Jira issue tracker? I would like
> to create and assign tickets.
> >
> > My jira username is aizhamal.
> >
> > Thanks and excited to be part of this community!
> > Aizhamal
>


Re: Python precommit duration is above 1hr

2019-03-11 Thread Mikhail Gryzykhin
That's cool! Thank you for working on this.

--Mikhail

Have feedback ?


On Mon, Mar 11, 2019 at 10:49 AM Mark Liu  wrote:

> Sorry for missing this thread in my inbox.
>
> Yes, I'm actively working on pull/7675
>  which works pretty well and is
> under review. At first, I tried detox but the test console output are all
> mixed together which makes debugging extremely hard. We also lost many
> advantages of Gradle and scan UI with detox.
>
> pull/7675  use Gradle
> parallelism and run tox tasks in there.
> https://scans.gradle.com/s/f3fkqqmiosejm is an example run.
>
> Mark
>
> On Sun, Mar 10, 2019 at 8:19 AM Robbe Sneyders 
> wrote:
>
>> Yes, this is largely due to the addition of Python 3 test suites.
>>
>> Running tests in parallel is actively being investigated by +Mark Liu
>>  in this Jira ticket [1] and this PR [2]. We will
>> add other Python 3.6 and 3.7 test suites only to postcommit until then.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6527
>> [2] https://github.com/apache/beam/pull/7675
>>
>> Kind regards,
>> Robbe
>>
>> [image: https://ml6.eu] 
>>
>> * Robbe Sneyders*
>>
>> ML6 Gent
>> 
>>
>> M: +32 474 71 31 08
>>
>>
>> On Sat, 9 Mar 2019 at 20:22, Robert Bradshaw  wrote:
>>
>>> Perhaps this is the duplication of all (or at least most) previously
>>> existing tests for running under Python 3. I agree that this is excessive;
>>> we should probably split out Py2, Py3, and the linters into separate
>>>  targets.
>>>
>>> We could look into using detox or retox to parallelize the testing as
>>> well. (The issue last time was suppression of output on timeout, but that
>>> can be worked around by adding timeouts to the individual tox targets.)
>>>
>>> On Fri, Mar 8, 2019 at 11:26 PM Mikhail Gryzykhin 
>>> wrote:
>>>
 Hi everyone,

 Seems that our python pre-commits grow up in time really fast
 
 .

 Did anyone follow trend or know what are the biggest changes that
 happened with python lately?

 I don't see a single jump, but duration of pre-commits almost doubled
 since new year.

 [image: image.png]

 Regards,
 --Mikhail

 Have feedback ?

>>>


Re: Python precommit duration is above 1hr

2019-03-11 Thread Mark Liu
Sorry for missing this thread in my inbox.

Yes, I'm actively working on pull/7675
 which works pretty well and is
under review. At first, I tried detox but the test console output are all
mixed together which makes debugging extremely hard. We also lost many
advantages of Gradle and scan UI with detox.

pull/7675  use Gradle parallelism
and run tox tasks in there. https://scans.gradle.com/s/f3fkqqmiosejm is an
example run.

Mark

On Sun, Mar 10, 2019 at 8:19 AM Robbe Sneyders 
wrote:

> Yes, this is largely due to the addition of Python 3 test suites.
>
> Running tests in parallel is actively being investigated by +Mark Liu
>  in this Jira ticket [1] and this PR [2]. We will add
> other Python 3.6 and 3.7 test suites only to postcommit until then.
>
> [1] https://issues.apache.org/jira/browse/BEAM-6527
> [2] https://github.com/apache/beam/pull/7675
>
> Kind regards,
> Robbe
>
> [image: https://ml6.eu] 
>
> * Robbe Sneyders*
>
> ML6 Gent
> 
>
> M: +32 474 71 31 08
>
>
> On Sat, 9 Mar 2019 at 20:22, Robert Bradshaw  wrote:
>
>> Perhaps this is the duplication of all (or at least most) previously
>> existing tests for running under Python 3. I agree that this is excessive;
>> we should probably split out Py2, Py3, and the linters into separate
>>  targets.
>>
>> We could look into using detox or retox to parallelize the testing as
>> well. (The issue last time was suppression of output on timeout, but that
>> can be worked around by adding timeouts to the individual tox targets.)
>>
>> On Fri, Mar 8, 2019 at 11:26 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Seems that our python pre-commits grow up in time really fast
>>> 
>>> .
>>>
>>> Did anyone follow trend or know what are the biggest changes that
>>> happened with python lately?
>>>
>>> I don't see a single jump, but duration of pre-commits almost doubled
>>> since new year.
>>>
>>> [image: image.png]
>>>
>>> Regards,
>>> --Mikhail
>>>
>>> Have feedback ?
>>>
>>


Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Ahmet Altay
I updated the JIRAs for these two PRs to set the fix version correctly as
2.12.0. That should fix the release notes issue.

On Mon, Mar 11, 2019 at 10:44 AM Ahmet Altay  wrote:

> Hi Etienne,
>
> I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST --
> github timestamp). Release tag, as you pointed out, points to a commit on
> Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.
>
> After cutting the release branch, I only merged cherry picks from master
> to the release branch if a JIRA was tagged as a release blocker and there
> was a PR to fix that specific issue. In case of these two PRs, they were
> merged at Feb 20 and Feb 18 respectively. They were not included in the
> branch cut and I did not cherry pick them either. I apologize if I missed a
> request to cherry pick these PRs.
>
> Does this answer your question?
>
> Ahmet
>
> [1]
> https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0
>
> On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
> wrote:
>
>> @Ahmet sorry I did not have time to check 2.11 release but a fellow Beam
>> contributor drew my attention on something:
>>
>> the 2.11 release tag points on commit of 02/26 and this[1] PR was merged
>> 02/20 and that [2] PR was merged on 02/18. So, both commits should be in
>> the released code but they are not.
>>
>> [1] https://github.com/apache/beam/pull/7348
>> [2] https://github.com/apache/beam/pull/7751
>>
>> So at least for those 2 features the release notes do not comply to
>> content of the release. Is there a real problem or did I miss something ?
>>
>> Etienne
>>
>> Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
>>
>> Thank you for the additional votes and validations.
>>
>> Update: Binaries are pushed. Website updates are blocked on an issue that
>> is preventing beam-site changes to be synced the beam website.
>> (INFRA-17953). I am waiting for that to be resolved before sending an
>> announcement.
>>
>> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
>> wrote:
>>
>> I see the vote has passed, but +1 (binding) from me as well.
>>
>> On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
>> wrote:
>> >
>> > +1 (binding)
>> >
>> > Tested with beam-samples.
>> >
>> > Regards
>> > JB
>> >
>> > On 26/02/2019 10:40, Ahmet Altay wrote:
>> > > Hi everyone,
>> > >
>> > > Please review and vote on the release candidate #2 for the version
>> > > 2.11.0, as follows:
>> > >
>> > > [ ] +1, Approve the release
>> > > [ ] -1, Do not approve the release (please provide specific comments)
>> > >
>> > > The complete staging area is available for your review, which
>> includes:
>> > > * JIRA release notes [1],
>> > > * the official Apache source release to be deployed to
>> dist.apache.org
>> > >  [2], which is signed with the key with
>> > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
>> > > * all artifacts to be deployed to the Maven Central Repository [4],
>> > > * source code tag "v2.11.0-RC2" [5],
>> > > * website pull request listing the release [6] and publishing the API
>> > > reference manual [7].
>> > > * Python artifacts are deployed along with the source release to the
>> > > dist.apache.org  [2].
>> > > * Validation sheet with a tab for 2.11.0 release to help with
>> validation
>> > > [8].
>> > >
>> > > The vote will be open for at least 72 hours. It is adopted by majority
>> > > approval, with at least 3 PMC affirmative votes.
>> > >
>> > > Thanks,
>> > > Ahmet
>> > >
>> > > [1]
>> > >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344775
>> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
>> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> > > [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1064/
>> > > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
>> > > [6] https://github.com/apache/beam/pull/7924
>> > > [7] https://github.com/apache/beam-site/pull/587
>> > > [8]
>> > >
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
>> > >
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>>
>>


Re: [VOTE] Release 2.11.0, release candidate #2

2019-03-11 Thread Ahmet Altay
Hi Etienne,

I cut the release branch on 2/14 at [1] (on Feb 14, 2019, 3:52 PM PST --
github timestamp). Release tag, as you pointed out, points to a commit on
Feb 25, 2019 11:48 PM PST. And that is a commit on the release branch.

After cutting the release branch, I only merged cherry picks from master to
the release branch if a JIRA was tagged as a release blocker and there was
a PR to fix that specific issue. In case of these two PRs, they were merged
at Feb 20 and Feb 18 respectively. They were not included in the branch cut
and I did not cherry pick them either. I apologize if I missed a request to
cherry pick these PRs.

Does this answer your question?

Ahmet

[1]
https://github.com/apache/beam/commit/a103edafba569b2fd185b79adffd91aaacb790f0

On Mon, Mar 11, 2019 at 1:50 AM Etienne Chauchot 
wrote:

> @Ahmet sorry I did not have time to check 2.11 release but a fellow Beam
> contributor drew my attention on something:
>
> the 2.11 release tag points on commit of 02/26 and this[1] PR was merged
> 02/20 and that [2] PR was merged on 02/18. So, both commits should be in
> the released code but they are not.
>
> [1] https://github.com/apache/beam/pull/7348
> [2] https://github.com/apache/beam/pull/7751
>
> So at least for those 2 features the release notes do not comply to
> content of the release. Is there a real problem or did I miss something ?
>
> Etienne
>
> Le lundi 04 mars 2019 à 11:42 -0800, Ahmet Altay a écrit :
>
> Thank you for the additional votes and validations.
>
> Update: Binaries are pushed. Website updates are blocked on an issue that
> is preventing beam-site changes to be synced the beam website.
> (INFRA-17953). I am waiting for that to be resolved before sending an
> announcement.
>
> On Mon, Mar 4, 2019 at 3:00 AM Robert Bradshaw 
> wrote:
>
> I see the vote has passed, but +1 (binding) from me as well.
>
> On Mon, Mar 4, 2019 at 11:51 AM Jean-Baptiste Onofré 
> wrote:
> >
> > +1 (binding)
> >
> > Tested with beam-samples.
> >
> > Regards
> > JB
> >
> > On 26/02/2019 10:40, Ahmet Altay wrote:
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > > 2.11.0, as follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > The complete staging area is available for your review, which includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release to be deployed to dist.apache.org
> > >  [2], which is signed with the key with
> > > fingerprint 64B84A5AD91F9C20F5E9D9A7D62E71416096FA00 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "v2.11.0-RC2" [5],
> > > * website pull request listing the release [6] and publishing the API
> > > reference manual [7].
> > > * Python artifacts are deployed along with the source release to the
> > > dist.apache.org  [2].
> > > * Validation sheet with a tab for 2.11.0 release to help with
> validation
> > > [8].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Ahmet
> > >
> > > [1]
> > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344775
> > > [2] https://dist.apache.org/repos/dist/dev/beam/2.11.0/
> > > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1064/
> > > [5] https://github.com/apache/beam/tree/v2.11.0-RC2
> > > [6] https://github.com/apache/beam/pull/7924
> > > [7] https://github.com/apache/beam-site/pull/587
> > > [8]
> > >
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=542393513
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>
>


Re: KafkaIO Exactly-Once & Flink Runner

2019-03-11 Thread Maximilian Michels

We cannot reason about correct exactly-once behavior of a transform without 
understanding how state management and fault-tolerance in the runner work.


Generally, we require a transforms's writes to be idempotent for 
exactly-once semantics, even with @RequiresStableInput.


In the case of KafkaIO, we have transactions which means writes cannot 
be indempotent per se. That's why we drop already-committed records by 
recovering the current committed id from Kafka itself: 
https://github.com/apache/beam/blob/99d5d9138acbf9e5b87e7068183c5fd27448043e/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L300


Beam's state interface is only used to persist the current record id. 
This is necessary to be able to replay the same ids upon restoring a 
failed job.


-Max

On 11.03.19 17:38, Thomas Weise wrote:
We cannot reason about correct exactly-once behavior of a transform 
without understanding how state management and fault-tolerance in the 
runner work.


Max pinged me this link to the Kafka EOS logic [1]. It uses a state 
variable to find out what was already written. That state variable would 
be part of a future Flink checkpoint. If after a failure we revert to 
the previous checkpoint, it won't help to discover/skip duplicates?


The general problem is that we are trying to rely on state in two 
different places to achieve EOS. This blog 
 
[2] describes how Kafka streams can provide the exactly-once guarantee, 
by using only Kafka as transactional resource (and committing all 
changes in a single TX). Everything else would require a distributed 
transaction coordinator (expensive) or a retry with duplicate detection 
mechanism in the external system (like check if record/reference was 
already written to Kafka, JDBC etc. or for file system, check if the 
file that would result from atomic rename already exists).


Thomas

[1] 
https://github.com/apache/beam/blob/99d5d9138acbf9e5b87e7068183c5fd27448043e/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L329 

[2] 
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/


On Mon, Mar 11, 2019 at 7:54 AM Maximilian Michels > wrote:


This is not really about barriers, those are an implementation detail.

If a transform is annotated with @RequiresStableInput, no data will be
processed by this transform until a complete checkpoint has been taken.
After checkpoint completion, the elements will be processed. In case of
any failures, the checkpoint will be restored and the elements will be
processed again. This requires idempotent writes. KafkaIO's EOS mode
does that by ignoring all elements which are already part of a commit.

-Max

On 11.03.19 15:15, Thomas Weise wrote:
 > So all records between 2 checkpoint barriers will be buffered and on
 > checkpoint complete notification sent in a single transaction to
Kafka?
 >
 > The next question then is what happens if the Kafka transaction
does not
 > complete (and checkpoint complete callback fails)? Will the
callback be
 > repeated after Flink recovers?
 >
 >
 > On Mon, Mar 11, 2019 at 3:02 AM Maximilian Michels
mailto:m...@apache.org>
 > >> wrote:
 >
 >      > But there is still the possibility that we fail to flush the
 >     buffer after the checkpoint is complete (data loss)?
 >
 >     Since we have already checkpointed the buffered data we can retry
 >     flushing it in case of failures. We may emit elements
multiple times
 >     but
 >     that is because the Kafka EOS sink will skip records which
are already
 >     part of a committed transaction.
 >
 >     -Max
 >
 >     On 06.03.19 19:28, Thomas Weise wrote:
 >      > A fair amount of work for true true exactly once output
was done in
 >      > Apex. Different from almost exactly-once :)
 >      >
 >      > The takeaway was that the mechanism to achieve it depends
on the
 >      > external system. The implementation looks different for
let's say
 >     a file
 >      > sink or JDBC or Kafka.
 >      >
 >      > Apex had an exactly-once producer before Kafka supported
 >     transactions.
 >      > That producer relied on the ability to discover what was
already
 >     written
 >      > to Kafka upon recovery from failure. Why?
 >      >
 >      > Runners are not distributed transaction coordinators and no
 >     matter how
 >      > we write the code, there is always the small possibility
that one
 >     of two
 >      > resources fails to commit, resulting in either data loss or
 >     duplicates.
 >      > The Kafka 

Re: Cross-language transform API

2019-03-11 Thread Robert Bradshaw
On Mon, Mar 11, 2019 at 4:37 PM Maximilian Michels  wrote:
>
> > Just to clarify. What's the reason for including a PROPERTIES enum here 
> > instead of directly making beam_urn a field of ExternalTransformPayload ?
>
> The URN is supposed to be static. We always use the same URN for this
> type of external transform. We probably want an additional identifier to
> point to the resource we want to configure.

It does feel odd to not use the URN to specify the transform itself,
and embed the true identity in an inner proto. The notion of
"external" is just how it happens to be invoked in this pipeline, not
part of its intrinsic definition. As we want introspection
capabilities in the service, we should be able to use the URN at a top
level and know what kind of payload it expects. I would also like to
see this kind of information populated for non-extern transforms which
could be good for visibility (substitution, visualization, etc.) for
runners and other pipeline-consuming tools.

> Like so:
>
> message ExternalTransformPayload {
>enum Enum {
>  PROPERTIES = 0
>  [(beam_urn) = "beam:external:transform:external_transform:v1"];
>}
>// A fully-qualified identifier, e.g. Java package + class
>string identifier = 1;

I'd rather the identifier have semantic rather than
implementation-specific meaning. e.g. one could imagine multiple
implementations of a given transform that different services could
offer.

>// the format may change to map if types are supported
>map parameters = 2;
> }
>
> The identifier could also be a URN.
>
> > Can we change first version to map ? Otherwise the set of 
> > transforms we can support/test will be very limited.
>
> How do we do that? Do we define a set of standard coders for supported
> types? On the Java side we can lookup the coder by extracting the field
> from the Pojo, but we can't do that in Python.
>
> > Can we re-use some of the Beam schemas-related work/utilities here ?
>
> Yes, that was the plan.

On this note, Reuven, what is the plan (and timeline) for a
language-independent representation of schemas? The crux of the
problem is that the user needs to specify some kind of configuration
(call it C) to construct the transform (call it T). This would be
handled by a TransformBuilder that provides (at least) a mapping
C -> T. (Possibly this interface could be offered on the transform
itself).

The question we are trying to answer here is how to represent C, in
both the source and target language, and on the wire. The idea is that
we could leverage the schema infrastructure such that C could be a
POJO in Java (and perhaps a dict in Python). We would want to extend
Schemas and Row (or perhaps a sub/super/sibling class thereof) to
allow for Coder and UDF-typed fields. (Exactly how to represent UDFs
is still very TBD.) The payload for a external transform using this
format would be the tuple (schema, SchemaCoder(schema).encode(C)). The
goal is to not, yet again, invent a cross-language way of defining a
bag of named, typed parameters (aka fields) with language-idiomatic
mappings and some introspection capabilities, and significantly less
heavy-weight than users defining their own protos (plus generating
bindings to all languages).

Does this seem a reasonable use of schemas?


Re: KafkaIO Exactly-Once & Flink Runner

2019-03-11 Thread Maximilian Michels

This is not really about barriers, those are an implementation detail.

If a transform is annotated with @RequiresStableInput, no data will be 
processed by this transform until a complete checkpoint has been taken. 
After checkpoint completion, the elements will be processed. In case of 
any failures, the checkpoint will be restored and the elements will be 
processed again. This requires idempotent writes. KafkaIO's EOS mode 
does that by ignoring all elements which are already part of a commit.


-Max

On 11.03.19 15:15, Thomas Weise wrote:
So all records between 2 checkpoint barriers will be buffered and on 
checkpoint complete notification sent in a single transaction to Kafka?


The next question then is what happens if the Kafka transaction does not 
complete (and checkpoint complete callback fails)? Will the callback be 
repeated after Flink recovers?



On Mon, Mar 11, 2019 at 3:02 AM Maximilian Michels > wrote:


 > But there is still the possibility that we fail to flush the
buffer after the checkpoint is complete (data loss)?

Since we have already checkpointed the buffered data we can retry
flushing it in case of failures. We may emit elements multiple times
but
that is because the Kafka EOS sink will skip records which are already
part of a committed transaction.

-Max

On 06.03.19 19:28, Thomas Weise wrote:
 > A fair amount of work for true true exactly once output was done in
 > Apex. Different from almost exactly-once :)
 >
 > The takeaway was that the mechanism to achieve it depends on the
 > external system. The implementation looks different for let's say
a file
 > sink or JDBC or Kafka.
 >
 > Apex had an exactly-once producer before Kafka supported
transactions.
 > That producer relied on the ability to discover what was already
written
 > to Kafka upon recovery from failure. Why?
 >
 > Runners are not distributed transaction coordinators and no
matter how
 > we write the code, there is always the small possibility that one
of two
 > resources fails to commit, resulting in either data loss or
duplicates.
 > The Kafka EOS was a hybrid of producer and consumer, the consumer
part
 > used during recovery to find out what was already produced
previously.
 >
 > Flink and Apex have very similar checkpointing model, that's why
this
 > thread caught my attention. Within the topology/runner,
exactly-once is
 > achieved by replay having the same effect. For sinks, it needs to
rely
 > on the capabilities of the respective system (like atomic rename for
 > file sink, or transaction with metadata table for JDBC).
 >
 > The buffering until checkpoint is complete is a mechanism to get
away
 > from sink specific implementations. It comes with the latency
penalty
 > (memory overhead could be solved with a write ahead log). But
there is
 > still the possibility that we fail to flush the buffer after the
 > checkpoint is complete (data loss)?
 >
 > Thanks,
 > Thomas
 >
 >
 > On Wed, Mar 6, 2019 at 9:37 AM Kenneth Knowles mailto:k...@apache.org>
 > >> wrote:
 >
 >     On Tue, Mar 5, 2019 at 10:02 PM Raghu Angadi
mailto:ang...@gmail.com>
 >     >> wrote:
 >
 >
 >
 >         On Tue, Mar 5, 2019 at 7:46 AM Reuven Lax
mailto:re...@google.com>
 >         >> wrote:
 >
 >             RE: Kenn's suggestion. i think Raghu looked into
something
 >             that, and something about it didn't work. I don't
remember
 >             all the details, but I think there might have been some
 >             subtle problem with it that wasn't obvious. Doesn't mean
 >             that there isn't another way to solve that issue.'
 >
 >
 >         Two disadvantages:
 >         - A transaction in Kafka are tied to single producer
instance.
 >         There is no official API to start a txn in one process and
 >         access it in another process. Flink's sink uses an
internal REST
 >         API for this.
 >
 >
 >     Can you say more about how this works?
 >
 >         - There is one failure case that I mentioned earlier: if
closing
 >         the transaction in downstream transform fails, it is data
loss,
 >         there is no way to replay the upstream transform that
wrote the
 >         records to Kafka.
 >
 >
 >     With coupling of unrelated failures due to fusion, this is a
severe
 >     problem. I think I see now how 2PC affects this. From my
reading, I
 >     can't see the difference in how Flink works. If the checkpoint
 >     

Re: KafkaIO Exactly-Once & Flink Runner

2019-03-11 Thread Maximilian Michels

Just realized, there was a word missing:

Since we have already checkpointed the buffered data we can retry 
flushing it in case of failures. We may emit elements multiple times but 
that is __fine__ because the Kafka EOS sink will skip records which are 
already part of a committed transaction.


On 11.03.19 11:02, Maximilian Michels wrote:
But there is still the possibility that we fail to flush the buffer 
after the checkpoint is complete (data loss)?


Since we have already checkpointed the buffered data we can retry 
flushing it in case of failures. We may emit elements multiple times but 
that is because the Kafka EOS sink will skip records which are already 
part of a committed transaction.


-Max

On 06.03.19 19:28, Thomas Weise wrote:
A fair amount of work for true true exactly once output was done in 
Apex. Different from almost exactly-once :)


The takeaway was that the mechanism to achieve it depends on the 
external system. The implementation looks different for let's say a 
file sink or JDBC or Kafka.


Apex had an exactly-once producer before Kafka supported transactions. 
That producer relied on the ability to discover what was already 
written to Kafka upon recovery from failure. Why?


Runners are not distributed transaction coordinators and no matter how 
we write the code, there is always the small possibility that one of 
two resources fails to commit, resulting in either data loss or 
duplicates. The Kafka EOS was a hybrid of producer and consumer, the 
consumer part used during recovery to find out what was already 
produced previously.


Flink and Apex have very similar checkpointing model, that's why this 
thread caught my attention. Within the topology/runner, exactly-once 
is achieved by replay having the same effect. For sinks, it needs to 
rely on the capabilities of the respective system (like atomic rename 
for file sink, or transaction with metadata table for JDBC).


The buffering until checkpoint is complete is a mechanism to get away 
from sink specific implementations. It comes with the latency penalty 
(memory overhead could be solved with a write ahead log). But there is 
still the possibility that we fail to flush the buffer after the 
checkpoint is complete (data loss)?


Thanks,
Thomas


On Wed, Mar 6, 2019 at 9:37 AM Kenneth Knowles > wrote:


    On Tue, Mar 5, 2019 at 10:02 PM Raghu Angadi mailto:ang...@gmail.com>> wrote:



    On Tue, Mar 5, 2019 at 7:46 AM Reuven Lax mailto:re...@google.com>> wrote:

    RE: Kenn's suggestion. i think Raghu looked into something
    that, and something about it didn't work. I don't remember
    all the details, but I think there might have been some
    subtle problem with it that wasn't obvious. Doesn't mean
    that there isn't another way to solve that issue.'


    Two disadvantages:
    - A transaction in Kafka are tied to single producer instance.
    There is no official API to start a txn in one process and
    access it in another process. Flink's sink uses an internal REST
    API for this.


    Can you say more about how this works?

    - There is one failure case that I mentioned earlier: if closing
    the transaction in downstream transform fails, it is data loss,
    there is no way to replay the upstream transform that wrote the
    records to Kafka.


    With coupling of unrelated failures due to fusion, this is a severe
    problem. I think I see now how 2PC affects this. From my reading, I
    can't see the difference in how Flink works. If the checkpoint
    finalization callback that does the Kafka commit fails, does it
    invalidate the checkpoint so the start transaction + write elements
    is retried?

    Kenn


    GBKs don't have major scalability limitations in most runner.
    Extra GBK is fine in practice for such a sink (at least no one
    has complained about it yet, though I don't know real usage
    numbers in practice). Flink's implentation in Beam
    using @RequiresStableInput  does have storage requirements and
    latency costs that increase with checkpoint interval. I think is
    still just as useful. Good to see @RequiresStableInput support
    added to Flink runner in Max's PR.


    Hopefully we can make that work. Another possibility if we
    can't is to do something special for Flink. Beam allows
    runners to splice out well-known transforms with their own
    implementation. Dataflow already does that for Google Cloud
    Pub/Sub sources/sinks. The Flink runner could splice out the
    Kafka sink with one that uses Flink-specific 
functionality.     Ideally this would reuse most of the 
existing Kafka code

    (maybe we could refactor just the EOS part into something
    that could be subbed out).

    Reuven

    On Tue, Mar 5, 2019 at 2:53 AM 

Re: Signing artefacts during release

2019-03-11 Thread Michael Luckey
Oops, of course. +dev@beam.apache.org 

On Mon, Mar 11, 2019 at 3:53 AM Kenneth Knowles  wrote:

> Did you mean to reply-all to the dev@ list too?
>
> On Sun, Mar 10, 2019 at 6:50 PM Michael Luckey 
> wrote:
>
>> Thanks, @Ahmet Altay  We need to look into those
>> issues.
>>
>> Opened PR [1], which should enable releasing with gradle 5. Also stumbled
>> upon usage of gradle release plugin [2] and version management [3]. Both of
>> them were somehow part of discussion on mailing list [4]. Not sure about
>> progress here, @Kenneth Knowles 
>>
>> [1] https://github.com/apache/beam/pull/8026
>> [2] https://issues.apache.org/jira/browse/BEAM-6798
>> [3] https://issues.apache.org/jira/browse/BEAM-6799
>> [4]
>> https://lists.apache.org/thread.html/205472bdaf3c2c5876533750d417c19b0d1078131a3dc04916082ce8@%3Cdev.beam.apache.org%3E
>>
>> On Sat, Mar 9, 2019 at 2:23 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Mar 8, 2019 at 2:55 AM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, Mar 8, 2019 at 2:42 AM Ahmet Altay  wrote:
 >
 > This sounds good to me.
 >
 > On Thu, Mar 7, 2019 at 3:32 PM Michael Luckey 
 wrote:
 >>
 >> Thanks for your comments.
 >>
 >> So to continue here, I ll prepare a PR implementing C:
 >>
 >> Pass the sign key to the relevant scripts and use that for signing.
 There is something similar already implemented [1]
 >>
 >> We might discuss on that, whether it will work for us or if we need
 to implement something different.
 >>
 >> This should affect at least 'build_release_candidate.sh' and
 'sign_hash_python_wheels.sh'. The release manager is responsible for
 selecting the proper key. Currently there is no 'state passed between the
 scripts', so the release manager will have to specify this repeatedly. This
 could probably be improved later on.
 >
 > This might become a problem. Is it possible for us to tackle this
 sooner than later?

 Requiring a key seems to be a good first step. (Personally, I like to
 be very explicit about what I sign.) Supporting defaults (e.g. in a
 ~/.beam-release config file) is a nice to have.

>>>
>>> +1
>>>
>>>

 >> @Ahmet Altay Could you elaborate which global state you are
 referring to? Is it only that git global configuration of the signing key?
 [2]
 >
 > I was referring to things not related to signing. I do not want to
 digress this thread but briefly I was referring to global installations of
 binaries with sudo and changes to bashrc file. We can work on those
 improvements separately.

 That's really bad. +1 to fixing these (as a separate bug).

>>>
>>> Filed https://issues.apache.org/jira/browse/BEAM-6795 with some
>>> additional information.
>>>
>>


Re: Executing gradlew build

2019-03-11 Thread Michael Luckey
On Mon, Mar 11, 2019 at 3:51 AM Kenneth Knowles  wrote:

> I have never actually tried to run a full build recently. It takes a long
> time and usually isn't what is needed for a particular change. FWIW I view
> Beam at this point as a mini-monorepo, so each directory and target can be
> healthy/unhealthy on its own.
>

Fair Point. Totally agree.


>
> But it does seem like we should at least know what is unhealthy and why.
> Have you been filing Jiras about the failures? Are they predictable? Are
> they targets that pass in Jenkins but not in vanilla build? That would mean
> our Jenkins environment is too rich and should be slimmed down probably.
>
> Kenn
>

Unfortunately those failures are not really predictable. Usually, I start
with plain './gradlew build' and keep adding some '-x
:beam-sdks-python:testPy2Gcp -x :beam-sdks-python:testPython' until build
succeeds. Afterwards it seems to be possible to remove this exclusions step
by step, thereby filling the build cache, which on next reruns might have
some impact on how tasks are executed.

Most of failures are python related. Had not much success getting into
those. From time to time I see 'seemingly' similar failures on Jenkins, but
tracing on python is more difficult coming from the more java background.
Also using Mac I believe to remember that preinstalled python had some
issues/differences compared with private installs. Others are those
Elasticsearch - which were worked on lately - and ClickHouse tests which
seem to be still flaky.

So I mainly blamed my setup and did not yet have the time to further track
those failures down. But

As I did use a vanilla system and was not able to get beam to build, i got
thinking about

1. The release process
The release manager has lot of stuff to take care for, but is also supposed
to run a full gradle build on her local machine [1]. Apart from that being
a long lasting task, if it keeps failing this puts additional burden on the
release manager. So I was wondering, why we can not push that to Jenkins as
we do with all these other tests [2]. Here I did not find any existing Job
doing such, so wanted to ask for feedback here.

If a full build has to be run - and of course it makes some sense on
release - I would suggest to get that running on a regular base on Jenkins
just to ensure not to be surprised during release. And as a sideeffect
enable the release manager to also delegate this to Jenkins to free her
time (and dev box).

Until now I had not yet have the time to investigate, whether all this
pre/Post/elseJobs cover all our tests on all modules. Hoped someone else
could point to the list of jobs which cover all tests.

2. Those newcomers
As a naive newcomer to a project I usually deal with those jars. After
problems arose, I might download corresponding sources and try to
investigate. First thing I usually do here is a make/build/install to get
those private builds in. Also on our contributor site [3] we recommend
ensuring to be able to run all tests with 'gradlew check' which is not to
far away from full build. Probably no one would expect a full build to fail
here, which again makes me think we need an equivalent Job on Jenkins here?

Of course it is also fully reasonable to not support this full build in
line with that mini-monorepo.

On the other hand, after filling the build cache - and enable more task for
caching - a full build should not be to expensive in general. Although I d
prefer some more granularity like e.g. 'gradlew -p sdks/java build' to
build all java, which, if I understand correctly, would be a side effect of
fixing [4].

[1]
https://github.com/apache/beam/blob/master/release/src/main/scripts/verify_release_build.sh#L142
[2]
https://github.com/apache/beam/blob/master/release/src/main/scripts/verify_release_build.sh#L168-L190
[3] https://beam.apache.org/contribute/
[4] https://issues.apache.org/jira/browse/BEAM-4046


>
> On Sun, Mar 10, 2019 at 7:05 PM Michael Luckey 
> wrote:
>
>> Hi,
>>
>> while fiddling with beams release process and trying to get that working
>> on my machine, I again stumbled about 'gradlew build'
>>
>> Till now I am unable to get a full build run successfully on my machine.
>> Keeps failing. After setting up a clean docker instance and trying to run
>> there, I also failed miserably. Which of kind is expected as part of the
>> tests rely on docker itself which probably does not work  out of the box.
>> But apart from these expected failures, there were also tons of other
>> failures. Need to setup a full blown vm and test again, but wanted to ask
>> here first as after looking into Jenkins setup, I was also unable to find a
>> job which actually executes a full build.
>>
>> So I am wondering, whether
>> - anyone is actually able to successfully execute a full build
>> (preferable without build cache)
>> - it is intended we do only run a 'gradlew build' during release on
>> release managers box
>> - if that full build is somehow implicit in Jenkins setup, and I 

Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-11 Thread Maximilian Michels

Congrats! :)

On 11.03.19 14:01, Etienne Chauchot wrote:

Congrats ! Well deserved

Etienne

Le lundi 11 mars 2019 à 13:22 +0100, Alexey Romanenko a écrit :

My congratulations, Raghu!

On 8 Mar 2019, at 10:39, Łukasz Gajowy > wrote:


Congratulations! :)

pt., 8 mar 2019 o 10:16 Gleb Kanterov > napisał(a):

Congratulations!

On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey > wrote:

Congrats Raghu!

On Thu, Mar 7, 2019 at 8:06 PM Mark Liu > wrote:

Congrats!

On Thu, Mar 7, 2019 at 10:45 AM Rui Wang > wrote:

Congrats Raghu!


-Rui

On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise > wrote:

Congrats!


On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson 
mailto:timrobertson...@gmail.com>> 
wrote:

Congrats Raghu

On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay > wrote:

Congratulations!

On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang 
mailto:ruo...@google.com>> wrote:

Thank you Raghu for your contribution!



On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan 
mailto:conne...@google.com>> wrote:

Congratulation Raghu!!! Thank you for sharing Kenn!!!

On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía 
mailto:ieme...@gmail.com>> wrote:

Congrats !

Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy 
mailto:aizha...@google.com>> a écrit :

Congratulations, Raghu!!!
On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles 
mailto:k...@apache.org>> wrote:

Hi all,

Please join me and the rest of the Beam PMC in welcoming 
a new committer: Raghu Angadi


Raghu has been contributing to Beam since early 2016! He 
has continuously improved KafkaIO and supported on the 
user@ list but his community contributions are even more 
extensive, including reviews, dev@ list discussions, 
improvements and ideas across SqsIO, FileIO, PubsubIO, 
and the Dataflow and Samza runners. In consideration of 
Raghu's contributions, the Beam PMC trusts Raghu with the 
responsibilities of a Beam committer [1].


Thank you, Raghu, for your contributions.

Kenn

[1] 
https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer 




--

Ruoyun  Huang




--
Cheers,
Gleb




Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-11 Thread Etienne Chauchot
Congrats ! Well deserved
Etienne
Le lundi 11 mars 2019 à 13:22 +0100, Alexey Romanenko a écrit :
> My congratulations, Raghu!
> 
> > On 8 Mar 2019, at 10:39, Łukasz Gajowy  wrote:
> > 
> > Congratulations! :)
> > pt., 8 mar 2019 o 10:16 Gleb Kanterov  napisał(a):
> > > Congratulations!
> > > On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey  
> > > wrote:
> > > > Congrats Raghu!
> > > > On Thu, Mar 7, 2019 at 8:06 PM Mark Liu  wrote:
> > > > > Congrats!
> > > > > On Thu, Mar 7, 2019 at 10:45 AM Rui Wang  wrote:
> > > > > > Congrats Raghu!
> > > > > > 
> > > > > > -Rui
> > > > > > On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise  
> > > > > > wrote:
> > > > > > > Congrats!
> > > > > > > 
> > > > > > > On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson 
> > > > > > >  wrote:
> > > > > > > > Congrats Raghu
> > > > > > > > 
> > > > > > > > On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  
> > > > > > > > wrote:
> > > > > > > > > Congratulations!
> > > > > > > > > 
> > > > > > > > > On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang 
> > > > > > > > >  wrote:
> > > > > > > > > > Thank you Raghu for your contribution! 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan 
> > > > > > > > > >  wrote:
> > > > > > > > > > > Congratulation Raghu!!! Thank you for sharing Kenn!!! 
> > > > > > > > > > > On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > Congrats !
> > > > > > > > > > > > Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy 
> > > > > > > > > > > >  a écrit :
> > > > > > > > > > > > > Congratulations, Raghu!!!
> > > > > > > > > > > > > On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles 
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > Please join me and the rest of the Beam PMC in 
> > > > > > > > > > > > > > welcoming a new committer: Raghu Angadi
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Raghu has been contributing to Beam since early 
> > > > > > > > > > > > > > 2016! He has continuously improved KafkaIO
> > > > > > > > > > > > > > and supported on the user@ list but his community 
> > > > > > > > > > > > > > contributions are even more extensive,
> > > > > > > > > > > > > > including reviews, dev@ list discussions, 
> > > > > > > > > > > > > > improvements and ideas across SqsIO, FileIO,
> > > > > > > > > > > > > > PubsubIO, and the Dataflow and Samza runners. In 
> > > > > > > > > > > > > > consideration of Raghu's contributions, the
> > > > > > > > > > > > > > Beam PMC trusts Raghu with the responsibilities of 
> > > > > > > > > > > > > > a Beam committer [1].
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thank you, Raghu, for your contributions.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Kenn
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > [1] 
> > > > > > > > > > > > > > https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> > > > > > > > > > 
> > > > > > > > > > -- 
> > > > > > > > > > Ruoyun  Huang
> > > > > > > > > > 
> > > > > > > > > > 
> > > 
> > > -- 
> > > Cheers,Gleb


Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-11 Thread Alexey Romanenko
My congratulations, Raghu!

> On 8 Mar 2019, at 10:39, Łukasz Gajowy  wrote:
> 
> Congratulations! :)
> 
> pt., 8 mar 2019 o 10:16 Gleb Kanterov  > napisał(a):
> Congratulations!
> 
> On Thu, Mar 7, 2019 at 11:52 PM Michael Luckey  > wrote:
> Congrats Raghu!
> 
> On Thu, Mar 7, 2019 at 8:06 PM Mark Liu  > wrote:
> Congrats!
> 
> On Thu, Mar 7, 2019 at 10:45 AM Rui Wang  > wrote:
> Congrats Raghu!
> 
> 
> -Rui
> 
> On Thu, Mar 7, 2019 at 10:22 AM Thomas Weise  > wrote:
> Congrats!
> 
> 
> On Thu, Mar 7, 2019 at 10:11 AM Tim Robertson  > wrote:
> Congrats Raghu
> 
> On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  > wrote:
> Congratulations!
> 
> On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang  > wrote:
> Thank you Raghu for your contribution! 
> 
> 
> 
> On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan  > wrote:
> Congratulation Raghu!!! Thank you for sharing Kenn!!! 
> 
> On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía  > wrote:
> Congrats !
> 
> Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy  > a écrit :
> Congratulations, Raghu!!!
> On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles  > wrote:
> Hi all,
> 
> Please join me and the rest of the Beam PMC in welcoming a new committer: 
> Raghu Angadi
> 
> Raghu has been contributing to Beam since early 2016! He has continuously 
> improved KafkaIO and supported on the user@ list but his community 
> contributions are even more extensive, including reviews, dev@ list 
> discussions, improvements and ideas across SqsIO, FileIO, PubsubIO, and the 
> Dataflow and Samza runners. In consideration of Raghu's contributions, the 
> Beam PMC trusts Raghu with the responsibilities of a Beam committer [1].
> 
> Thank you, Raghu, for your contributions.
> 
> Kenn
> 
> [1] 
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>  
> 
> 
> -- 
> 
> Ruoyun  Huang
> 
> 
> 
> -- 
> Cheers,
> Gleb



Re: KafkaIO Exactly-Once & Flink Runner

2019-03-11 Thread Maximilian Michels

But there is still the possibility that we fail to flush the buffer after the 
checkpoint is complete (data loss)?


Since we have already checkpointed the buffered data we can retry 
flushing it in case of failures. We may emit elements multiple times but 
that is because the Kafka EOS sink will skip records which are already 
part of a committed transaction.


-Max

On 06.03.19 19:28, Thomas Weise wrote:
A fair amount of work for true true exactly once output was done in 
Apex. Different from almost exactly-once :)


The takeaway was that the mechanism to achieve it depends on the 
external system. The implementation looks different for let's say a file 
sink or JDBC or Kafka.


Apex had an exactly-once producer before Kafka supported transactions. 
That producer relied on the ability to discover what was already written 
to Kafka upon recovery from failure. Why?


Runners are not distributed transaction coordinators and no matter how 
we write the code, there is always the small possibility that one of two 
resources fails to commit, resulting in either data loss or duplicates. 
The Kafka EOS was a hybrid of producer and consumer, the consumer part 
used during recovery to find out what was already produced previously.


Flink and Apex have very similar checkpointing model, that's why this 
thread caught my attention. Within the topology/runner, exactly-once is 
achieved by replay having the same effect. For sinks, it needs to rely 
on the capabilities of the respective system (like atomic rename for 
file sink, or transaction with metadata table for JDBC).


The buffering until checkpoint is complete is a mechanism to get away 
from sink specific implementations. It comes with the latency penalty 
(memory overhead could be solved with a write ahead log). But there is 
still the possibility that we fail to flush the buffer after the 
checkpoint is complete (data loss)?


Thanks,
Thomas


On Wed, Mar 6, 2019 at 9:37 AM Kenneth Knowles > wrote:


On Tue, Mar 5, 2019 at 10:02 PM Raghu Angadi mailto:ang...@gmail.com>> wrote:



On Tue, Mar 5, 2019 at 7:46 AM Reuven Lax mailto:re...@google.com>> wrote:

RE: Kenn's suggestion. i think Raghu looked into something
that, and something about it didn't work. I don't remember
all the details, but I think there might have been some
subtle problem with it that wasn't obvious. Doesn't mean
that there isn't another way to solve that issue.'


Two disadvantages:
- A transaction in Kafka are tied to single producer instance.
There is no official API to start a txn in one process and
access it in another process. Flink's sink uses an internal REST
API for this.


Can you say more about how this works?

- There is one failure case that I mentioned earlier: if closing
the transaction in downstream transform fails, it is data loss,
there is no way to replay the upstream transform that wrote the
records to Kafka.


With coupling of unrelated failures due to fusion, this is a severe
problem. I think I see now how 2PC affects this. From my reading, I
can't see the difference in how Flink works. If the checkpoint
finalization callback that does the Kafka commit fails, does it
invalidate the checkpoint so the start transaction + write elements
is retried?

Kenn


GBKs don't have major scalability limitations in most runner.
Extra GBK is fine in practice for such a sink (at least no one
has complained about it yet, though I don't know real usage
numbers in practice). Flink's implentation in Beam
using @RequiresStableInput  does have storage requirements and
latency costs that increase with checkpoint interval. I think is
still just as useful. Good to see @RequiresStableInput support
added to Flink runner in Max's PR.


Hopefully we can make that work. Another possibility if we
can't is to do something special for Flink. Beam allows
runners to splice out well-known transforms with their own
implementation. Dataflow already does that for Google Cloud
Pub/Sub sources/sinks. The Flink runner could splice out the
Kafka sink with one that uses Flink-specific functionality. 
Ideally this would reuse most of the existing Kafka code

(maybe we could refactor just the EOS part into something
that could be subbed out).

Reuven

On Tue, Mar 5, 2019 at 2:53 AM Maximilian Michels
mailto:m...@apache.org>> wrote:

 > It would be interesting to see if there's something
we could add to the Beam model that would create a
better story for Kafka's EOS writes.

There would have to be a checkpoint-completed callback
the