from:"Reza Rokni"

Re: [ANNOUNCE] New committer: Ahmed Abualsaud

2023-08-27 Thread Reza Rokni via dev

Congrats Ahmed!

On Fri, Aug 25, 2023 at 2:34 PM John Casey via dev 
wrote:

> Congrats Ahmed!
>
> On Fri, Aug 25, 2023 at 10:43 AM Bjorn Pedersen via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats Ahmed! Well deserved!
>>
>> On Fri, Aug 25, 2023 at 10:36 AM Yi Hu via dev 
>> wrote:
>>
>>> Congrats Ahmed!
>>>
>>> On Fri, Aug 25, 2023 at 10:11 AM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Ahmed!

 On Fri, Aug 25, 2023 at 9:53 AM Kerry Donny-Clark via dev <
 dev@beam.apache.org> wrote:

> Well done Ahmed!
>
> On Fri, Aug 25, 2023 at 9:17 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats Ahmed!
>>
>> On Fri, Aug 25, 2023 at 3:16 AM Jan Lukavský  wrote:
>>
>>> Congrats Ahmed!
>>> On 8/25/23 07:56, Anand Inguva via dev wrote:
>>>
>>> Congratulations Ahmed :)
>>>
>>> On Fri, Aug 25, 2023 at 1:17 AM Damon Douglas <
>>> damondoug...@apache.org> wrote:
>>>
 Well deserved! Congratulations, Ahmed! I'm so happy for you.

 On Thu, Aug 24, 2023, 5:46 PM Byron Ellis via dev <
 dev@beam.apache.org> wrote:

> Congratulations!
>
> On Thu, Aug 24, 2023 at 5:34 PM Robert Burke 
> wrote:
>
>> Congratulations Ahmed!!
>>
>> On Thu, Aug 24, 2023, 4:08 PM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Ahmed!!
>>>
>>> On Thu, Aug 24, 2023 at 4:06 PM Bruno Volpato via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations, Ahmed!

 Very well deserved!


 On Thu, Aug 24, 2023 at 6:09 PM XQ Hu via dev <
 dev@beam.apache.org> wrote:

> Congratulations, Ahmed!
>
> On Thu, Aug 24, 2023, 5:49 PM Ahmet Altay via dev <
> dev@beam.apache.org> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a
>> new committer: Ahmed Abualsaud (ahmedabuals...@apache.org).
>>
>> Ahmed has been part of the Beam community since January 2022,
>> working mostly on IO connectors, made a large amount of 
>> contributions to
>> make Beam IOs more usable, performant, and reliable. And at the 
>> same time
>> Ahmed was active in the user list and at the Beam summit helping 
>> users by
>> sharing his knowledge.
>>
>> Considering their contributions to the project over this
>> timeframe, the Beam PMC trusts Ahmed with the responsibilities 
>> of a Beam
>> committer. [1]
>>
>> Thank you Ahmed! And we are looking to see more of your
>> contributions!
>>
>> Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>>
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>

Re: [ANNOUNCE] New committer: Damon Douglas

2023-04-25 Thread Reza Rokni via dev

Congratulations!

On Tue, Apr 25, 2023 at 6:52 AM Bjorn Pedersen via dev 
wrote:

> Congrats Damon! Well deserved!
>
> On Tue, Apr 25, 2023 at 9:41 AM John Casey via dev 
> wrote:
>
>> Congrats Damon!
>>
>> On Tue, Apr 25, 2023 at 9:36 AM Yi Hu via dev 
>> wrote:
>>
>>> Congrats Damon!
>>>
>>> On Tue, Apr 25, 2023 at 8:55 AM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations Damon!

 On Tue, Apr 25, 2023 at 12:03 AM Byron Ellis via dev <
 dev@beam.apache.org> wrote:

> Congrats Damon!
>
> On Mon, Apr 24, 2023 at 8:57 PM Austin Bennett 
> wrote:
>
>> thanks for all you do @Damon Douglas  !
>>
>> On Mon, Apr 24, 2023 at 1:00 PM Robert Burke 
>> wrote:
>>
>>> Congratulations Damon!!!
>>>
>>> On Mon, Apr 24, 2023, 12:52 PM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Damon Douglas (damondoug...@apache.org)

 Damon has contributed widely: Beam Katas, playground,
 infrastructure, and many IO connectors. Damon does lots of code review 
 in
 addition to code. (yes, you can review code as a non-committer!)

 Considering their contributions to the project over this timeframe,
 the Beam PMC trusts Damon with the responsibilities of a Beam 
 committer. [1]

 Thank you Damon! And we are looking to see more of your
 contributions!

 Kenn, on behalf of the Apache Beam PMC

 [1]

 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>

Re: [ANNOUNCE] New committer: Anand Inguva

2023-04-25 Thread Reza Rokni via dev

Congratulations!

On Tue, Apr 25, 2023 at 3:48 AM Alexey Romanenko 
wrote:

> Congratulations, Anand! Well deserved!
>
> On 25 Apr 2023, at 06:02, Byron Ellis via dev  wrote:
>
> Congrats Anand!
>
> On Mon, Apr 24, 2023 at 9:54 AM Ahmet Altay via dev 
> wrote:
>
>> Congratulations Anand!
>>
>> On Mon, Apr 24, 2023 at 8:05 AM Kerry Donny-Clark via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Great work Anand, this is well deserved.
>>>
>>>
>>> On Mon, Apr 24, 2023 at 10:35 AM Yi Hu via dev 
>>> wrote:
>>>
 Congrats Anand!

 On Fri, Apr 21, 2023 at 3:54 PM Danielle Syse via dev <
 dev@beam.apache.org> wrote:

> Congratulations!
>
> On Fri, Apr 21, 2023 at 3:53 PM Damon Douglas via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Anand!
>>
>> On Fri, Apr 21, 2023 at 12:28 PM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations Anand!
>>>
>>> On Fri, Apr 21, 2023 at 3:24 PM Ahmed Abualsaud via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Anand!

 On Fri, Apr 21, 2023 at 3:18 PM Anand Inguva via dev <
 dev@beam.apache.org> wrote:

> Thanks everyone. Really excited to be a part of Beam Committers.
>
> On Fri, Apr 21, 2023 at 3:07 PM XQ Hu via dev 
> wrote:
>
>> Congratulations, Anand!!!
>>
>> On Fri, Apr 21, 2023 at 2:31 PM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations, Anand!
>>>
>>> On Fri, Apr 21, 2023 at 2:28 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations!

 On Fri, Apr 21, 2023 at 8:19 PM Jan Lukavský 
 wrote:

> Congrats Anand!
> On 4/21/23 20:05, Robert Burke wrote:
>
> Congratulations Anand!
>
> On Fri, Apr 21, 2023, 10:55 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Woohoo, congrats Anand! This is very well deserved!
>>
>> On Fri, Apr 21, 2023 at 1:54 PM Chamikara Jayalath <
>> chamik...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a
>>> new committer: Anand Inguva (ananding...@apache.org)
>>>
>>> Anand has been contributing to Apache Beam for more than a
>>> year and  authored and reviewed more than 100 PRs. Anand has 
>>> been a core
>>> contributor to Beam Python SDK and drove the efforts to support 
>>> Python 3.10
>>> and Python 3.11.
>>>
>>> Considering their contributions to the project over this
>>> timeframe, the Beam PMC trusts Anand with the responsibilities 
>>> of a Beam
>>> committer. [1]
>>>
>>> Thank you Anand! And we are looking to see more of your
>>> contributions!
>>>
>>> Cham, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>
>

Re: [ANNOUNCE] New PMC Member: Jan Lukavský

2023-02-16 Thread Reza Rokni via dev

Congratulations!

On Thu, Feb 16, 2023 at 7:47 AM Robert Burke  wrote:

> Congratulations!
>
> On Thu, Feb 16, 2023, 7:44 AM Danielle Syse via dev 
> wrote:
>
>> Congrats, Jan! That's awesome news. Thank you for your continued
>> contributions!
>>
>> On Thu, Feb 16, 2023 at 10:42 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Jan Lukavský <
>>> j...@apache.org> as our newest PMC member.
>>>
>>> Jan has been a part of Beam community and a long time contributor since
>>> 2018 in many significant ways, including code contributions in different
>>> areas, participating in technical discussions, advocating for users, giving
>>> a talk at Beam Summit and even writing one of the few Beam books!
>>>
>>> Congratulations Jan and thanks for being a part of Apache Beam!
>>>
>>> ---
>>> Alexey
>>
>>

Re: [ANNOUNCE] New committer: Ning Kang

2021-03-24 Thread Reza Rokni

Congratulations!

On Wed, Mar 24, 2021 at 12:53 PM Kenneth Knowles  wrote:

> Congratulations and thanks for all your contributions, Ning!
>
> Kenn
>
> On Tue, Mar 23, 2021 at 4:10 PM Valentyn Tymofieiev 
> wrote:
>
>> Congratulations, Ning, very well deserved!
>>
>> On Tue, Mar 23, 2021 at 2:01 PM Chamikara Jayalath 
>> wrote:
>>
>>> Congrats Ning!
>>>
>>> On Tue, Mar 23, 2021 at 1:23 PM Rui Wang  wrote:
>>>
 Congrats!



 -Rui

 On Tue, Mar 23, 2021 at 1:05 PM Yichi Zhang  wrote:

> Congratulations Ning!
>
> On Tue, Mar 23, 2021 at 1:00 PM Robin Qiu  wrote:
>
>> Congratulations Ning!
>>
>> On Tue, Mar 23, 2021 at 12:56 PM Ahmet Altay 
>> wrote:
>>
>>> Congratulations Ning!
>>>
>>> On Tue, Mar 23, 2021 at 12:38 PM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Congrats, Ning Kang! Well deserved!
 Thank you for your contributions and users support!

 Alexey

 On 23 Mar 2021, at 20:35, Pablo Estrada  wrote:

 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Ning Kang.

 Ning has been working in Beam for a while. He has contributed to
 the interactive experience of the Pyhton SDK, and developed a sidebar
 component, along with a release process for it. Ning has also helped 
 users
 on StackOverflow and user@, especially when it comes to
 Interactive Beam.

 Considering these contributions, the Beam PMC trusts Ning with the
 responsibilities of a Beam committer.[1]

 Thanks Ning!
 -P.

 [1] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer

Re: [ANNOUNCE] New committer: Piotr Szuberski

2021-01-26 Thread Reza Rokni

Congrats!

On Tue, Jan 26, 2021 at 4:25 AM Piotr Szuberski 
wrote:

> Thank you everyone! I really don't know what to say. I'm truly honoured. I
> do hope I will be able to keep up with the contributions.
>
> On 2021/01/22 16:32:45, Alexey Romanenko 
> wrote:
> > Hi everyone,
> >
> > Please join me and the rest of the Beam PMC in welcoming a new
> committer: Piotr Szuberski .
> >
> > Piotr started to contribute to Beam about one year ago and he did it
> very actively since then. He contributed to the different areas, like
> adding a cross-language functionality to existing IOs, improving ITs and
> performance tests environment/runtime, he actively worked on dependency
> updates [1].
> >
> > In consideration of his contributions, the Beam PMC trusts him with the
> responsibilities of a Beam committer [2].
> >
> > Thank you for your contributions, Piotr!
> >
> > -Alexey, on behalf of the Apache Beam PMC
> >
> > [1]
> https://github.com/apache/beam/pulls?q=is%3Apr+author%3Apiotr-szuberski <
> https://github.com/apache/beam/pulls?q=is:pr+author:piotr-szuberski>
> > [2]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >
> >
> >
>

Re: [ANNOUNCE] New PMC Member: Chamikara Jayalath

2021-01-21 Thread Reza Rokni

Congratulations!

On Fri, Jan 22, 2021 at 6:58 AM Ankur Goenka  wrote:

> Congrats Cham!
>
> On Thu, Jan 21, 2021 at 2:57 PM Ahmet Altay  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of Beam PMC in welcoming Chamikara Jayalath
>> as our
>> newest PMC member.
>>
>> Cham has been part of the Beam community from its early days and
>> contributed to the project in significant ways, including contributing new
>> features and improvements especially related Beam IOs, advocating for
>> users, and mentoring new community members.
>>
>> Congratulations Cham! And thanks for being a part of Beam!
>>
>> Ahmet
>>
>

Re: Farewell mail

2020-12-16 Thread Reza Rokni

Thank you Piotr!

On Thu, Dec 17, 2020 at 5:30 AM Rui Wang  wrote:

> Thank you Piotr for all your contributions so far! Wish all the best!
>
>
> -Rui
>
> On Wed, Dec 16, 2020 at 12:36 PM Ismaël Mejía  wrote:
>
>> Thanks Piotr,
>>
>> You made an impact on Beam! Best wishes in the future projects and
>> feel welcome whenever you want to contribute again.
>>
>> Ismaël
>>
>>
>>
>> On Wed, Dec 16, 2020 at 9:02 PM Brian Hulette 
>> wrote:
>> >
>> > Thank you for all your contributions! Good luck in your future
>> endeavors :)
>> >
>> > Brian
>> >
>> > On Wed, Dec 16, 2020 at 9:35 AM Griselda Cuevas 
>> wrote:
>> >>
>> >> Thank you Piotr for your contributions.
>> >>
>> >> On Wed, 16 Dec 2020 at 09:16, Ahmet Altay  wrote:
>> >>>
>> >>> Thank you Piotr and best wishes!
>> >>>
>> >>> On Wed, Dec 16, 2020 at 8:46 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>> 
>>  Piotr,
>> 
>>  Thanks a lot for your contributions, it was very useful and made
>> Beam more stable and finally even better! And it was always interesting to
>> work with you =)
>> 
>>  I wish you all the best in your next adventure but feel free to get
>> back to Beam and contribute in any way as you can. It is always welcome!
>> 
>>  Alexey
>> 
>>  > On 16 Dec 2020, at 17:16, Piotr Szuberski <
>> piotr.szuber...@polidea.com> wrote:
>>  >
>>  > Hi all,
>>  >
>>  > This week is the last one I'm working on Beam. It was a pleasure
>> to contribute to this project. I've learned a lot and had really good time
>> with you guys!
>>  >
>>  > The IT world is quite small so there is no goodbye. See you in the
>> future in the Web and another great projects!
>>  >
>>  > You can find me at https://github.com/piotr-szuberski
>>  >
>>  > Piotr
>> 
>>
>

Re: [PROPOSAL] Preparing for Beam 2.26.0 release

2020-10-27 Thread Reza Rokni

+1 Thanx!

On Wed, Oct 28, 2020 at 7:14 AM Valentyn Tymofieiev 
wrote:

> +1, thanks and good luck!
>
> On Tue, Oct 27, 2020 at 1:59 PM Tyson Hamilton  wrote:
>
>> Thanks Rebo! SGTM.
>>
>> On Tue, Oct 27, 2020 at 11:11 AM Udi Meiri  wrote:
>>
>>> +1 sg!
>>>
>>> On Tue, Oct 27, 2020 at 10:02 AM Robert Burke  wrote:
>>>
 Hello everyone!

 The next Beam release (2.26.0) is scheduled to be cut on November 4th
 according to the release calendar [1].

 I'd like to volunteer myself to handle this release. I plan on cutting
 the
 branch on November 5th (since I've had November 4th booked off for
 months now) and cherry-picking in release-blocking fixes
 afterwards. So unresolved release blocking JIRA issues should have
 their "Fix Version/s" marked as "2.26.0".

 Any comments or objections?

 Thanks,
 Robert Burke
 @lostluck
 [1]
 https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com

>>>

Re: Cleaning up Approximate Algorithms in Beam

2020-10-13 Thread Reza Rokni

Hi,

Sorry it took almost a year before we found time...

https://github.com/apache/beam/pull/12973 ( Robin and *Andrea  have agreed
to review). *

With this PR the old ApproximateUnique will be marked as deprecated. With
notes to make use of ApproximateCountDistinct.java
<https://github.com/apache/beam/pull/12973/files#diff-8b3de19b55328b9ca265ec25d57c86e2c59a9c505195146df4af5f9eb73e5fc8>
.

Thanx

Reza

On Wed, Nov 27, 2019 at 2:29 AM Robert Bradshaw  wrote:

> I think this thread is sufficient.
>
> On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni  wrote:
>
>> Hi,
>>
>> So do we need a vote for the final list of actions? Or is this thread
>> enough to go ahead and raise the PR's?
>>
>> Cheers
>>
>> Reza
>>
>> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>>>>
>>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>>> version: 
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>>>>  *
>>>>> Eventually we will be able to make use of cross language transforms to
>>>>> help with feature parity. Until then, are we ok with marking this
>>>>> deprecated in python, even though we do not have another solution. Or 
>>>>> leave
>>>>> it as is in Python now, as it does not have sketch capability so can only
>>>>> be used for outputting results directly from the pipeline.
>>>>>
>>>>
>>> If it is our intention to add the capability eventually, IMO it makes
>>> sense to mark the existing functionality deprecated in Python as well.
>>>
>>>
>>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>>> True, I have an old action item to try and go through and PR against
>>>>> old @experimental annotations but need to find time. So for this
>>>>> discussion; I guess this should be marked as deprecated if we change it
>>>>> even though its @experimental.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> *Rob: I'm not following this--by naming things after their
>>>>> implementation rather than their intent I think they will be harder to
>>>>> search for. *
>>>>> This is to add to the name the implementation, after the intent. For
>>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>>> search for and it is clear which implementation is used. Allowing for a
>>>>> potentially better implementation ApproximateCountDistinct.
>>>>>
>>>>
>>>> OK, if we have both I'm more OK with that. This is better than the
>>>> names like HllCount, which seems to be what was suggested.
>>>>
>>>> Another approach would be to have a required  parameter which is an
>>>>> enum of the implementation options.
>>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>>
>>>>
>>>> Ideally this could be an optional parameter, or possibly only required
>>>> during update until we figure out a good way for the runner to plug this in
>>>> appropreately.
>>>>
>>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>>> needs of this thread to this larger discussion?
>>>>>
>>>>> Cheers
>>>>> Reza
>>>>>
>>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>>
>>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>>> state clear in its name/URN. The "update compatibility" problem makes

Re: [BEAM-10587] Support Maps in BigQuery #12389

2020-10-11 Thread Reza Rokni

+1 on configuration parameter for enable this and leave current behaviour
as default.

On Sat, Oct 10, 2020 at 12:35 AM Andrew Pilloud  wrote:

> BigQuery has no native support for Map types, but I agree that we should
> be consistent with how other tools import maps into BigQuery. Is this
> something Dataflow templates do? What other tools are there?
>
> Beam ZetaSQL also lacks support for Map types. I like the idea of adding a
> configuration parameter to turn this on and retaining the existing behavior
> by default.
>
> Thanks for sending this to the list!
>
> Andrew
>
> On Fri, Oct 9, 2020 at 7:20 AM Jeff Klukas  wrote:
>
>> It's definitely desirable to be able to get back Map types from BQ, and
>> it's nice that BQ is consistent in representing maps as repeated key/value
>> structs. Inferring maps from that specific structure is preferable to
>> inventing some new naming convention for the fields, which would hinder
>> interoperability with non-Beam applications.
>>
>> Would it be possible to add a configurable parameter called something
>> like withMapsInferred() ? Default behavior would be the status quo, but
>> users could opt in to the behavior of inferring maps based on field names.
>> This would prevent the PR change from potentially breaking existing
>> applications. And it means the least surprising behavior remains the
>> default.
>>
>> On Fri, Oct 9, 2020 at 6:06 AM Worley, Ryan 
>> wrote:
>>
>>> https://github.com/apache/beam/pull/12389
>>>
>>> Hi everyone, in the above pull request I am attempting to add support
>>> for writing Avro records with maps to a BigQuery table (via Beam Schema).
>>> The write portion is fairly straightforward - we convert the map to an
>>> array of structs with key and value fields (seemingly the closest possible
>>> approximation of a map in BigQuery).  But the read back portion is more
>>> controversial because we simply check if a field is an array of structs
>>> with exactly two fields - key and value - and assume that should be read
>>> into a Schema map field.
>>>
>>> So the possibility exists that an array of structs with key and value
>>> fields, which wasn't originally written from a map, could be unexpectedly
>>> read into a map.  In the PR review I suggested a few options for tagging
>>> the BigQuery field, so that we could know it was written from a Beam Schema
>>> map and should be read back into one, but I'm not very satisfied with any
>>> of the options.
>>>
>>> Andrew Pilloud suggested that I write to this group to get some feedback
>>> on the issue.  Should we be concerned that all arrays of structs with
>>> exactly 'key' and 'value' fields would be read into a Schema map or could
>>> this be considered a feature?  If the former, how would you suggest that we
>>> limit reading into a map only those fields that were originally written
>>> from a map?
>>>
>>> Thanks for any feedback to help bump this PR along!
>>>
>>> NOTICE:
>>>
>>> This message, and any attachments, contain(s) information that may be
>>> confidential or protected by privilege from disclosure and is intended only
>>> for the individual or entity named above. No one else may disclose, copy,
>>> distribute or use the contents of this message for any purpose. Its
>>> unauthorized use, dissemination or duplication is strictly prohibited and
>>> may be unlawful. If you receive this message in error or you otherwise are
>>> not an authorized recipient, please immediately delete the message and any
>>> attachments and notify the sender.
>>>
>>

Re: [ANNOUNCE] New committer: Reza Ardeshir Rokni

2020-09-11 Thread Reza Rokni

Thanx everyone! Looking forward to being able to contribute more :-)

On Sat, Sep 12, 2020 at 4:33 AM Valentyn Tymofieiev 
wrote:

> Congrats!
>
> On Thu, Sep 10, 2020 at 8:08 PM Connell O'Callaghan 
> wrote:
>
>> Excellent- well done Reza!!!
>>
>> On Thu, Sep 10, 2020 at 7:35 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> Thanks and congrats, Reza!
>>>
>>> On Thu, Sep 10, 2020 at 5:48 PM Heejong Lee  wrote:
>>>
 Congratulations!

 On Thu, Sep 10, 2020 at 4:42 PM Robert Bradshaw 
 wrote:

> Thank you and welcome, Reza!
>
> On Thu, Sep 10, 2020 at 4:00 PM Ahmet Altay  wrote:
>
>> Congratulations Reza! And thank you for your contributions!
>>
>> On Thu, Sep 10, 2020 at 3:59 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> Congrats Reza!
>>>
>>> On Thu, Sep 10, 2020 at 10:35 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Reza Ardeshir Rokni.

 Reza has been part of the Beam community since 2017! Reza has
 spearheaded advanced Beam examples [1], blogged and presented at 
 multiple
 Beam Summits. Reza helps out users on the mailing lists [2] and
 StackOverflow [3]. When Reza's work uncovers a missing feature in 
 Beam, he
 adds it [4]. Considering these contributions, the Beam PMC trusts Reza 
 with
 the responsibilities of a Beam committer [5].

 Thank you, Reza, for your contributions.

 Kenn

 [1] https://github.com/apache/beam/pull/3961
 [2]
 https://lists.apache.org/list.html?u...@beam.apache.org:gte=0d:reza%20rokni
 [3] https://stackoverflow.com/tags/apache-beam/topusers
 [4] https://github.com/apache/beam/pull/11929
 [5]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>>
>>
>>
>
>

>>>
>>> --
>> Your feedback welcomed for Connello!!!
>> 
>>
>

Re: [ANNOUNCE] New committer: Heejong Lee

2020-09-10 Thread Reza Rokni

Congratulations!

On Thu, Sep 10, 2020 at 12:20 AM Alexey Romanenko 
wrote:

> Congratulations!
>
> On 9 Sep 2020, at 09:50, Jan Lukavský  wrote:
>
> Congratulations!
> On 9/9/20 1:00 AM, Chamikara Jayalath wrote:
>
> Congrats Heejong!
>
> On Tue, Sep 8, 2020 at 1:55 PM Yichi Zhang  wrote:
>
>> Congratulations Heejong!
>>
>> On Tue, Sep 8, 2020 at 1:42 PM Ankur Goenka  wrote:
>>
>>> Congratulations Heejong!
>>>
>>> On Tue, Sep 8, 2020 at 1:40 PM Udi Meiri  wrote:
>>>
 Congrats Heejong!

 On Tue, Sep 8, 2020 at 12:33 PM Tyson Hamilton 
 wrote:

> Congratulations!
>
> On Tue, Sep 8, 2020, 12:10 PM Robert Bradshaw 
> wrote:
>
>> Congratulations, Heejong!
>>
>> On Tue, Sep 8, 2020 at 11:41 AM Rui Wang  wrote:
>>
>>> Congrats, Heejong!
>>>
>>>
>>> -Rui
>>>
>>> On Tue, Sep 8, 2020 at 11:26 AM Robin Qiu 
>>> wrote:
>>>
 Congrats, Heejong!

 On Tue, Sep 8, 2020 at 11:23 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations, Heejong!
>
> On Tue, Sep 8, 2020 at 11:14 AM Ahmet Altay 
> wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming
>> a new committer: Heejong Lee .
>>
>> Heejong has been active in the community for more than 2 years,
>> worked on various IOs (parquet, kafka, file, pubsub) and most 
>> recently
>> worked on adding cross language transforms feature to Beam [1].
>>
>> In consideration of his contributions, the Beam PMC trusts him
>> with the responsibilities of a Beam committer [2].
>>
>> Thank you for your contributions Heejong!
>>
>> -Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-10634?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(heejong)%20ORDER%20BY%20resolved%20DESC%2C%20affectedVersion%20ASC%2C%20priority%20DESC%2C%20updated%20DESC
>> [2]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>
>

Re: [PROPOSAL] Preparing for Beam 2.25.0 release

2020-09-10 Thread Reza Rokni

Thanx Robin!

On Thu, Sep 10, 2020 at 2:30 AM Ahmet Altay  wrote:

> Thank you Robin!
>
> On Wed, Sep 9, 2020 at 10:23 AM Rui Wang  wrote:
>
>> Thanks Robin for working on this!
>>
>>
>> -Rui
>>
>> On Wed, Sep 9, 2020 at 10:11 AM Robin Qiu  wrote:
>>
>>> Hello everyone,
>>>
>>> The next Beam release (2.25.0) is scheduled to be cut on September 23
>>> according to the release calendar [1].
>>>
>>> I'd like to volunteer myself to handle this release. I plan on cutting
>>> the branch on that date and cherry-picking in release-blocking fixes
>>> afterwards. So unresolved release blocking JIRA issues should have
>>> their "Fix Version/s" marked as "2.25.0".
>>>
>>> Any comments or objections?
>>>
>>> Thanks,
>>> Robin Qiu
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>>
>>

Re: [VOTE] Release 2.23.0, release candidate #1

2020-07-16 Thread Reza Rokni

Hi,

Are there strong objections to the ability to do patches?

Cheers

Reza

On Fri, Jul 17, 2020 at 9:16 AM Valentyn Tymofieiev 
wrote:

> Thanks for the feedback, help with release validation, and for reaching
> out on dev@ regarding a cherry-pick request.
>
> BEAM-10397  pertains to
> new functionality (xlang support on Dataflow). Are there any reasons that
> this fix cannot wait until 2.24.0 (release cut date 4 weeks from now)?
>
> For transparency, I would like to list other cherry-pick requests that I
> received off-the list (stakeholders bcc'ed):
> - https://github.com/apache/beam/pull/12175
> - https://github.com/apache/beam/pull/12196
> - https://github.com/apache/beam/pull/12171
> - https://issues.apache.org/jira/browse/BEAM-10492 (recently added)
> - https://issues.apache.org/jira/browse/BEAM-10385
> - https://github.com/apache/beam/pull/12187 (was available before any of
> RC1 artifacts were created and integrated)
>
> My response to such requests is guided by the release guide [1]:
>
> - None of the issues were a regression from a previous release.
> - Most are related to new or recently introduced functionality.
> - 3 of the requests are related to xlang io, which is very exciting and
> important functionality, but arguably does not impact a large percentage of
> [existing] users.
>
> So they do not seem to be release-blocking according to the guide.
>
> At this point creating a new RC would delay 2.23.0 availability by at
> least a week. While a new RC will improve the stability of xlang IO, it
> will also delay the release of  features and bug fixes available in 2.23.0.
> It will also create a precedent of inconsistency with release
> policy. Should we delay the release if we discover another xlang issue
> during validation next week?
>
> My preferred course of action is to continue with RC0, since release
> velocity is important for product health.
>
> Given that we are having this conversation, we can revise the cherry-pick
> policy if we think it does not adequately cover this situation.
>
> We can also propose a patch-version release  with urgent cherry-picks
> (release 2.23.1), or consider a faster release cadence if 6 weeks is too
> slow.
>
> Thanks,
> Valentyn
>
> [1] https://beam.apache.org/contribute/release-guide/#review-cherry-picks
>
>
>
> On Wed, Jul 15, 2020 at 5:41 PM Chamikara Jayalath 
> wrote:
>
>> I agree. I think Dataflow x-lang users could run into flaky pipelines due
>> to this. Valentyn, are you OK with creating a new RC that includes the fix
>> (already merged - https://github.com/apache/beam/pull/12164) and
>> preferably https://github.com/apache/beam/pull/12196 ?
>>
>> Thanks,
>> Cham
>>
>> On Wed, Jul 15, 2020 at 5:27 PM Heejong Lee  wrote:
>>
>>> I think we need to cherry-pick
>>> https://issues.apache.org/jira/browse/BEAM-10397 which fixes missing
>>> environment errors for Dataflow xlang pipelines. Internally, we have a
>>> flaky xlang kafkaio test because of missing environment errors and any
>>> xlang pipelines using GroupByKey could encounter this.
>>>
>>> On Wed, Jul 15, 2020 at 5:08 PM Ahmet Altay  wrote:
>>>


 On Wed, Jul 15, 2020 at 4:55 PM Robert Bradshaw 
 wrote:

> All the artifacts, signatures, and hashes look good.
>
> I would like to understand the severity of
> https://issues.apache.org/jira/browse/BEAM-10397 before giving my
> vote.
>

 +Heejong Lee  to comment on this.


>
> On Wed, Jul 15, 2020 at 10:51 AM Pablo Estrada 
> wrote:
> >
> > +1
> > I was able to run the python 3.8 quickstart from wheels on
> DirectRunner.
> > I verified hashes for Python files.
> > -P.
> >
> > On Fri, Jul 10, 2020 at 4:34 PM Ahmet Altay 
> wrote:
> >>
> >> I validated the python 3 quickstarts. I had issues with running
> with python 3.8 wheel files, but did not have issues with source
> distributions, or other python wheel files. I have not tested python 2
> quickstarts.
>

 Did someone validate python 3.8 wheels on Dataflow? I was not able to
 run that.


> >>
> >> On Thu, Jul 9, 2020 at 10:53 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> Please review and vote on the release candidate #1 for the version
> 2.23.0, as follows:
> >>> [ ] +1, Approve the release
> >>> [ ] -1, Do not approve the release (please provide specific
> comments)
> >>>
> >>>
> >>> The complete staging area is available for your review, which
> includes:
> >>> * JIRA release notes [1],
> >>> * the official Apache source release to be deployed to
> dist.apache.org [2], which is signed with the key with fingerprint
> 1DF50603225D29A4 [3],
> >>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>> * source code tag "v2.23.0-RС1" [5],
> >>> *

Re: Streaming pipeline "most-recent" join

2020-07-09 Thread Reza Rokni

Hya,

I never got a chance to finish this one, maybe I will get some time in the
summer break... but I think it will help with your use case...

https://github.com/rezarokni/beam/blob/BEAM-7386/sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java

Cheers
Reza

On Fri, Jul 10, 2020 at 8:58 AM Harrison Green 
wrote:

> Hi Beam devs,
>
> I'm working on a streaming pipeline where we need to do a "most-recent"
> join between two PCollections. Specifically, something like:
>
> out = pcoll1 | beam.Map(lambda a,b: (a,b),
> b=beam.pvalue.AsSingleton(pcoll2))
>
> The goal is to join each value in pcoll1 with only the most recent value
> from pcoll2. (in this case pcoll2 is much more sparse than pcoll1)
> ---
> altay@ suggested using a global window for the side-input pcollection
> with a trigger on each element. I've been trying to simulate this behavior
> locally with beam.testing.TestStream but I've been running into some issues.
>
> Specifically, the Repeatedly(AfterCount(1)) trigger seems to work
> correctly, but the side input receives too many panes (even when using
> discarding accumulation). I've set up a minimal demo here:
> https://colab.research.google.com/drive/1K0EqcKWxa4UK3SrkLBeHs7HSynw_VfSZ?usp=sharing
> In this example, I'm trying to join values from pcollection "a" with
> pcollection "b". However each pane of pcollection "a" is able to "see" all
> of the panes from pcollection "b" which is not what I would expect.
>
> I am curious if anyone has advice for how to handle this type of problem
> or an alternative solution for the "most-recent" join. (side note: I was
> able to hack together an alternative solution that uses a custom
> window/windowing strategy but it was fairly complex and I think a strategy
> that uses GlobalWindows would be preferred).
>
> Sincerely,
> Harrison
>

Re: [ANNOUNCE] New committer: Aizhamal Nurmamat kyzy

2020-06-30 Thread Reza Rokni

Congratulations !

On Tue, Jun 30, 2020 at 2:46 PM Michał Walenia 
wrote:

> Congratulations, Aizhamal! :)
>
> On Tue, Jun 30, 2020 at 8:41 AM Tobiasz Kędzierski <
> tobiasz.kedzier...@polidea.com> wrote:
>
>> Congratulations Aizhamal! :)
>>
>> On Mon, Jun 29, 2020 at 11:50 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> Congratulations, @Aizhamal Nurmamat kyzy  !
>>>
>>> On Mon, Jun 29, 2020 at 2:32 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congratulations and big thank you for all the hard work on Beam,
 Aizhamal!

 On Mon, Jun 29, 2020 at 9:56 AM Kenneth Knowles 
 wrote:

> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Aizhamal Nurmamat kyzy
>
> Over the last 15 months or so, Aizhamal has driven many efforts in the
> Beam community and contributed to others. Aizhamal started by helping with
> the Beam newsletter [1] then continued by contributing to meetup planning
> [2] [3] and Beam Summit planning [4]. Aizhamal created Beam's system for
> managing social media [5] and contributed many tweets, coordinated the 
> vote
> and design of Beam's mascot [6] [7], drove migration of Beam's site to a
> more i18n-friendly infrastructure [8], kept on top of Beam's enrollment in
> Season of Docs [9], and even organized remote Beam Webinars during the
> pandemic [10].
>
> In consideration of Aizhamal's contributions, the Beam PMC trusts her
> with
> the responsibilities of a Beam committer [11].
>
> Thank you, Aizhamal, for your contributions and looking forward to
> many more!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
> https://lists.apache.org/thread.html/447ae9fdf580ad88522aabc8a0f3703c51acd8885578bb422389a4b0%40%3Cdev.beam.apache.org%3E
> [2]
>
> https://lists.apache.org/thread.html/ebeeae53a64dca8bb491e26b8254d247226e6d770e33dbc9428202df%40%3Cdev.beam.apache.org%3E
> [3]
>
> https://lists.apache.org/thread.html/rc31d3d57b39e6cf12ea3b6da0e884f198f8cbef9a73f6a50199e0e13%40%3Cdev.beam.apache.org%3E
> [4]
>
> https://lists.apache.org/thread.html/99815d5cd047e302b0ef4b918f2f6db091b8edcf430fb62e4eeb1060%40%3Cdev.beam.apache.org%3E
> [5]
> https://lists.apache.org/thread.html/babceeb52624fd4dd129c259db8ee9017cb68cba069b68fca7480c41%40%3Cdev.beam.apache.org%3E
> [6]
>
> https://lists.apache.org/thread.html/60aa4b149136e6aa4643749731f4b5a041ae4952e7b7e57654888bed%40%3Cdev.beam.apache.org%3E
> [7]
>
> https://lists.apache.org/thread.html/r872ba2860319cbb5ca20de953c43ed7d750155ca805cfce3b70085b0%40%3Cdev.beam.apache.org%3E
> [8]
>
> https://lists.apache.org/thread.html/rfab4cc1411318c3f4667bee051df68f37be11846ada877f3576c41a9%40%3Cdev.beam.apache.org%3E
> [9]
>
> https://lists.apache.org/thread.html/r4df2e596751e263a83300818776fbb57cb1e84171c474a9fd016ec10%40%3Cdev.beam.apache.org%3E
> [10]
>
> https://lists.apache.org/thread.html/r81b93d700fedf3012b9f02f56b5d693ac4c1aac1568edf9e0767b15f%40%3Cuser.beam.apache.org%3E
> [11]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>

Re: Request to be Added as Contributor

2020-06-25 Thread Reza Rokni

Welcome !

On Thu, 25 Jun 2020, 20:08 Alexey Romanenko, 
wrote:

> Hi,
>
> I added you to Contributors list.
>
> Welcome!
>
> > On 25 Jun 2020, at 13:36, Abhishek Yadav  wrote:
> >
> > Hi!
> >
> > I am Abhishek Yadav and I am working as a SWE intern at Google this
> summer.
> > I would like to be added as a contributor in the Jira.
> >
> > My ASF Jira username is abhiy.
> >
> > Thanks,
> >
> > Abhishek
>
>

Re: Jira Board to track most Beam's tickets

2020-06-17 Thread Reza Rokni

Very nice!

On Thu, 18 Jun 2020, 05:27 Pablo Estrada,  wrote:

> I was not aware that you could track by number of watchers. That seems
> like a great option.
> It looks great Gris, thanks.
> Best
> -P.
>
> On Wed, Jun 17, 2020 at 10:44 AM Tyson Hamilton 
> wrote:
>
>> This is very cool! I love the bubble visualization. Is there a way to
>> have something similar for resolved issues?
>>
>> I'd also be interested in seeing other breakdowns of open/resolved
>> issues, like burndowns for example.
>>
>> On Tue, Jun 16, 2020 at 3:34 PM Griselda Cuevas  wrote:
>>
>>> Hi folks,
>>>
>>> I made this Jira board [1] to help us have visibility into our project's
>>> tickets.
>>>
>>> Right now, the first view shows a filter of the most watched tickets
>>> with key metrics, like the number of watchers, votes, date it was created,
>>> time of last comment and the issue description. The bubble view offers a
>>> correlation between the most watched tickets, the time they've been open
>>> and the number of interactions on it, red means high correlation and a
>>> possible interpretation is that these tickets have a high level of
>>> interest/urgency.
>>>
>>> One of the uses I want this board to have is to help us triage issues we
>>> haven't triage yet, so I will add that next.
>>>
>>> Do you have any suggestions into what view we should add?
>>> Any questions you'd like to see answers for in the Dashboard?
>>>
>>> Thanks!
>>> G
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335849
>>>
>>

Re: [ANNOUNCE] New PMC Member: Alexey Romanenko

2020-06-17 Thread Reza Rokni

Congratulations!

On Wed, Jun 17, 2020 at 2:48 PM Michał Walenia 
wrote:

> Congratulations!
>
> On Tue, Jun 16, 2020 at 11:45 PM Rui Wang  wrote:
>
>> Congrats!
>>
>>
>> -Rui
>>
>> On Tue, Jun 16, 2020 at 2:42 PM Ankur Goenka  wrote:
>>
>>> Congratulations Alexey!
>>>
>>> On Tue, Jun 16, 2020 at 2:41 PM Thomas Weise  wrote:
>>>
 Congratulations!

 On Tue, Jun 16, 2020 at 1:27 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations!
>
> On Tue, Jun 16, 2020 at 11:41 AM Ahmet Altay  wrote:
>
>> Congratulations!
>>
>> On Tue, Jun 16, 2020 at 10:05 AM Pablo Estrada 
>> wrote:
>>
>>> Yooohooo! Thanks for all your contributions and hard work Alexey!:)
>>>
>>> On Tue, Jun 16, 2020, 8:57 AM Ismaël Mejía 
>>> wrote:
>>>
 Please join me and the rest of Beam PMC in welcoming Alexey
 Romanenko as our
 newest PMC member.

 Alexey has significantly contributed to the project in different
 ways: new
 features and improvements in the Spark runner(s) as well as
 maintenance of
 multiple IO connectors including some of our most used ones (Kafka
 and
 Kinesis/Aws). Alexey is also quite active helping new contributors
 and our user
 community in the mailing lists / slack and Stack overflow.

 Congratulations Alexey!  And thanks for being a part of Beam!

 Ismaël

>>>
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>

Re: [ANNOUNCE] New committer: Robin Qiu

2020-05-18 Thread Reza Rokni

Congratulations!

On Tue, May 19, 2020 at 10:06 AM Ahmet Altay  wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Robin Qiu .
>
> Robin has been active in the community for close to 2 years, worked
> on HyperLogLog++ [1], SQL [2], improved documentation, and helped with
> releases(*).
>
> In consideration of his contributions, the Beam PMC trusts him with the
> responsibilities of a Beam committer [3].
>
> Thank you for your contributions Robin!
>
> -Ahmet, on behalf of the Apache Beam PMC
>
> [1] https://www.meetup.com/Zurich-Apache-Beam-Meetup/events/265529665/
> [2] https://www.meetup.com/Belgium-Apache-Beam-Meetup/events/264933301/
> [3] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
> (*) And maybe he will be a release manager soon :)
>
>

Re: [VOTE] Accept the Firefly design donation as Beam Mascot - Deadline Mon April 6

2020-04-06 Thread Reza Rokni

+1(non-binding)

On Mon, Apr 6, 2020 at 5:24 PM Alexey Romanenko 
wrote:

> +1 (non-binding).
>
> > On 3 Apr 2020, at 14:53, Maximilian Michels  wrote:
> >
> > +1 (binding)
> >
> > On 03.04.20 10:33, Jan Lukavský wrote:
> >> +1 (non-binding).
> >>
> >> On 4/2/20 9:24 PM, Austin Bennett wrote:
> >>> +1 (nonbinding)
> >>>
> >>> On Thu, Apr 2, 2020 at 12:10 PM Luke Cwik  >>> > wrote:
> >>>
> >>>+1 (binding)
> >>>
> >>>On Thu, Apr 2, 2020 at 11:54 AM Pablo Estrada  >>>> wrote:
> >>>
> >>>+1! (binding)
> >>>
> >>>On Thu, Apr 2, 2020 at 11:19 AM Alex Van Boxel
> >>>mailto:a...@vanboxel.be>> wrote:
> >>>
> >>>Thanks for clearing this up Aizhamal.
> >>>
> >>>+1 (non binding)
> >>>
> >>>_/
> >>>_/ Alex Van Boxel
> >>>
> >>>
> >>>On Thu, Apr 2, 2020 at 8:14 PM Aizhamal Nurmamat kyzy
> >>>mailto:aizha...@apache.org>> wrote:
> >>>
> >>>Good point, Alex. Actually Julian and I have talked
> >>>about producing this kind of guide. It will be
> >>>delivered as an additional contribution in the follow
> >>>up. We think this will be a derivative of the original
> >>>design, and be done after the original is officially
> >>>accepted.
> >>>
> >>>With this vote, we want to accept the Firefly donation
> >>>as designed [1], and let Julian produce other
> >>>artifacts using the official Beam mascot later on.
> >>>
> >>>[1]
> https://docs.google.com/document/d/1zK8Cm8lwZ3ALVFpD1aY7TLCVNwlyTS3PXxTV2qQCAbk/edit?usp=sharing
> >>>
> >>>
> >>>On Thu, Apr 2, 2020 at 10:37 AM Alex Van Boxel
> >>>mailto:a...@vanboxel.be>> wrote:
> >>>
> >>>I don't want to be a spoiler... but this vote
> >>>feels like a final deliverable... but without a
> >>>style guide as Kenn originally suggested most of
> >>>use will not be able to adapt the design. This
> >>>would include:
> >>>
> >>>  * frontal view
> >>>  * side view
> >>>  * back view
> >>>
> >>>actually different posses so we can mix and match.
> >>>Without this it will never reach the potential of
> >>>the Go gopher or gRPC Pancakes.
> >>>
> >>>Note this is *not* a negative vote but I'm afraid
> >>>that the use without a guide will be fairly
> >>>limited as most of use are not designers. Just a
> >>>concern.
> >>>
> >>> _/
> >>>_/ Alex Van Boxel
> >>>
> >>>
> >>>On Thu, Apr 2, 2020 at 7:27 PM Andrew Pilloud
> >>>mailto:apill...@apache.org>>
> >>>wrote:
> >>>
> >>>+1, Accept the donation of the Firefly design
> >>>as Beam Mascot
> >>>
> >>>On Thu, Apr 2, 2020 at 10:19 AM Julian Bruno
> >>> >>>> wrote:
> >>>
> >>>Hello Apache Beam Community,
> >>>
> >>>Please vote on the acceptance of the final
> >>>design of the Firefly as Beam's mascot
> >>>[1]. Please share your input no later than
> >>>Monday, April 6, at noon Pacific Time.
> >>>
> >>>
> >>>[ ] +1, Accept the donation of the Firefly
> >>>design as Beam Mascot
> >>>
> >>>[ ] -1, Decline the donation of the
> >>>Firefly design as Beam Mascot
> >>>
> >>>
> >>>Vote is adopted by at least 3 PMC +1
> >>>approval votes, with no PMC -1 disapproval
> >>>
> >>>votes. Non-PMC votes are still encouraged.
> >>>
> >>>PMC voters, please help by indicating your
> >>>vote as "(binding)"
> >>>
> >>>
> >>>The vote and input phase will be open
> >>>until Monday, April 6, at 12 pm Pacific
> Time.
> >>>
> >>>
> >>>Thank you very much for your feedback and
> >>>ideas,
> >>>
> >>>Julian
> >>>
> >>>
> >>>[1]
> >>>
> https://docs.google.com/document/d/1zK8Cm8lwZ3ALVFpD1aY7TLCVNwlyTS3PXxTV2qQCAbk/edit?usp=sharing
>
> >>>
> >>>--
> >>>Julian Bruno // Visual Artist & Graphic
> >>>

Re: [Interactive Runner] now available on master

2020-03-18 Thread Reza Rokni

Awesome !

On Thu, 19 Mar 2020, 05:38 Sam Rohde,  wrote:

> Hi All!
>
>
>
> I am happy to announce that an improved Interactive Runner is now
> available on master. This Python runner allows for the interactive
> development of Beam pipelines in a notebook (and IPython) environment.
>
>
>
> The runner still has some bugs that need to be fixed as well as some
> refactoring, but it is in a good enough shape to start using it.
>
>
>
> Here are the new things you can do with the Interactive Runner:
>
>-
>
>Create and execute pipelines within a REPL
>-
>
>Visualize elements as the pipeline is running
>-
>
>Materialize PCollections to DataFrames
>-
>
>Record unbounded sources for deterministic replay
>-
>
>Replay cached unbounded sources including watermark advancements
>
> The code lives in sdks/python/apache_beam/runners/interactive
> 
> and example notebooks are in
> sdks/python/apache_beam/runners/interactive/examples
> 
> .
>
>
>
> To install, use `pip install -e .[interactive]` in your  root>/sdks/python directory.
>
> To run, here’s a quick example:
>
> ```
>
> import apache_beam as beam
>
> from apache_beam.runners.interactive.interactive_runner import
> InteractiveRunner
>
> import apache_beam.runners.interactive.interactive_beam as ib
>
>
>
> p = beam.Pipeline(InteractiveRunner())
>
> words = p | beam.Create(['To', 'be', 'or', 'not', 'to', 'be'])
>
> counts = words | 'count' >> beam.combiners.Count.PerElement()
>
>
>
> # Shows a dynamically updating display of the PCollection elements
>
> ib.show(counts)
>
>
>
> # We can now visualize the data using standard pandas operations.
>
> df = ib.collect(counts)
>
> print(df.info())
>
> print(df.describe())
>
>
>
> # Plot the top-10 counted words
>
> df = df.sort_values(by=1, ascending=False)
>
> df.head(n=10).plot(x=0, y=1)
>
> ```
>
>
>
> Currently, Batch is supported on any runner. Streaming is only supported
> on the DirectRunner (non-FnAPI).
>
>
>
> I would like to thank the great work of Sindy (@sindyli) and Harsh
> (@ananvay) for the initial implementation,
>
> David Yan (@davidyan) who led the project, Ning (@ningk) and myself
> (@srohde) for the implementation and design, and Ahmet (@altay), Daniel
> (@millsd), Pablo (@pabloem), and Robert (@robertwb) who all contributed a
> lot of their time to help with the design and code reviews.
>
>
>
> It was a team effort and we wouldn't have been able to complete it without
> the help of everyone involved.
>
>
>
> Regards,
>
> Sam
>
>

Re: Hello Beam Community!

2020-03-13 Thread Reza Rokni

Welcome!

On Sat, 14 Mar 2020, 01:27 Tomo Suzuki,  wrote:

> Welcome.
>
> On Fri, Mar 13, 2020 at 1:20 PM Udi Meiri  wrote:
>
>> Welcome!
>>
>>
>> On Fri, Mar 13, 2020 at 9:47 AM Yichi Zhang  wrote:
>>
>>> Welcome!
>>>
>>> On Fri, Mar 13, 2020 at 9:40 AM Ahmet Altay  wrote:
>>>
 Welcome Brittany!

 On Thu, Mar 12, 2020 at 6:32 PM Brittany Hermann 
 wrote:

> Hello Beam Community!
>
> My name is Brittany Hermann and I recently joined the Open Source team
> in Data Analytics at Google. As a Program Manager, I will be focusing on
> community engagement while getting to work on Apache Beam and Airflow
> projects! I have always thrived on creating healthy, diverse, and overall
> happy communities and am excited to bring that to the team. For a fun 
> fact,
> I am a big Wisconsin Badgers Football fan and have a goldendoodle puppy
> named Ollie!
>
> I look forward to collaborating with you all!
>
> Kind regards,
>
> Brittany Hermann
>
>
>
>
> --
> Regards,
> Tomo
>

Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-03-07 Thread Reza Rokni

Hi,

There are some known issues with the DirectRunner PubSubIO, which are not
present with Dataflow Runner. One of them is around watermarks:

https://issues.apache.org/jira/browse/BEAM-7322

Not sure if this is part of the issue here, but worth exploring..

When testing are you sending a small volume of information and then
stopping or are you sending continuous output?

Cheers

Reza

On Sat, Mar 7, 2020 at 1:29 PM Kenneth Knowles  wrote:

> Can you reproduce it if you replace your Pubsub source with a TestStream
> and verify with PAssert [1]? This would enable you to easily build a unit
> test. You could even open a pull request adding that to the test suite for
> GroupIntoBatches [2]. That would be an excellent contribution to Beam.
>
> Kenn
>
> [1] https://beam.apache.org/blog/2016/10/20/test-stream.html
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/GroupIntoBatchesTest.java
>
> On Mon, Mar 2, 2020 at 9:25 AM Vasu Gupta  wrote:
>
>> Input : a-1, Timestamp : 1582994620366
>> Input : c-2, Timestamp : 1582994620367
>> Input : e-3, Timestamp : 1582994620367
>> Input : d-4, Timestamp : 1582994620367
>> Input : e-5, Timestamp : 1582994620367
>> Input : b-6, Timestamp : 1582994620368
>> Input : a-7, Timestamp : 1582994620368
>>
>> Output : Timestamp : 1582994620367, Key : e-3,5
>> Output : Timestamp : 1582994620368, Key : a-1,7
>>
>> As you can see c-2 and d-4 are missing and I never received these packets.
>>
>> On 2020/02/28 18:15:03, Kenneth Knowles  wrote:
>> > What are the timestamps on the elements?
>> >
>> > On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta 
>> wrote:
>> >
>> > > Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
>> > > Issue Details:
>> > > Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1, e-4,
>> e-5
>> > > Batch Size: 5
>> > > Expected output: a-1,4, b-3, c-5, d-1, e-4,5
>> > > Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4,
>> c-5 etc
>> > > But i always got correct number of packets with BATCH_SIZE = 1
>> > >
>> > > On 2020/02/27 20:40:16, Kenneth Knowles  wrote:
>> > > > Can you share some more details? What is the expected output and
>> what
>> > > > output are you seeing?
>> > > >
>> > > > On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta > >
>> > > wrote:
>> > > >
>> > > > > Hey folks, I am using Apache beam Framework in Java with Direction
>> > > Runner
>> > > > > for local testing purposes. When using GroupIntoBatches with batch
>> > > size 1
>> > > > > it works perfectly fine i.e. the output of the transform is
>> consistent
>> > > and
>> > > > > as expected. But when using with batch size > 1 the output
>> Pcollection
>> > > has
>> > > > > less data than it should be.
>> > > > >
>> > > > > Pipeline flow:
>> > > > > 1. A Transform for reading from pubsub
>> > > > > 2. Transform for making a KV out of the data
>> > > > > 3. A Fixed Window transform of 1 second
>> > > > > 4. Applying GroupIntoBatches transform
>> > > > > 5. And last, Logging the resulting Iterables.
>> > > > >
>> > > > > Weird thing is that it batch_size > 1 works great when running on
>> > > > > DataflowRunner but not with DirectRunner. I think the issue might
>> be
>> > > with
>> > > > > Timer Expiry since GroupIntoBatches uses BagState internally.
>> > > > >
>> > > > > Any help will be much appreciated.
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-29 Thread Reza Rokni

Congratilation Kamil

On Sat, 29 Feb 2020, 06:18 Udi Meiri,  wrote:

> Welcome Kamil!
>
> On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía  wrote:
>>
>>> Congratulations Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>>>
 Congrats, Kamil!

 On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations, Kamil!
>
> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
> wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Kamil Wasilewski
>>
>> Kamil has contributed to Beam in many ways, including the performance
>> testing infrastructure, and a custom BQ source, along with other
>> contributions.
>>
>> In consideration of his contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Kamil!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>>

Re: [ANNOUNCE] New committer: Chad Dombrova

2020-02-24 Thread Reza Rokni

Congratulations! :-)

On Tue, Feb 25, 2020 at 6:41 AM Chad Dombrova  wrote:

> Thanks, folks!  I'm very excited to "retest this" :)
>
> Especially big thanks to Robert and Udi for all their hard work reviewing
> my PRs.
>
> -chad
>
>
> On Mon, Feb 24, 2020 at 1:44 PM Brian Hulette  wrote:
>
>> Congratulations Chad! Thanks for all your contributions :)
>>
>> On Mon, Feb 24, 2020 at 1:43 PM Kyle Weaver  wrote:
>>
>>> Well-deserved, thanks for your dedication to the project Chad. :)
>>>
>>> On Mon, Feb 24, 2020 at 1:34 PM Udi Meiri  wrote:
>>>
 Congrats and welcome, Chad!

 On Mon, Feb 24, 2020 at 1:21 PM Pablo Estrada 
 wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Chad Dombrova
>
> Chad has contributed to the project in multiple ways, including
> improvements to the testing infrastructure, and adding type annotations
> throughout the Python SDK, as well as working closely with the community 
> on
> these improvements.
>
> In consideration of his contributions, the Beam PMC trusts him with
> the responsibilities of a Beam Committer[1].
>
> Thanks Chad for your contributions!
>
> -Pablo, on behalf of the Apache Beam PMC.
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>
>

Re: [ANNOUNCE] New committer: Jincheng Sun

2020-02-24 Thread Reza Rokni

Congrats!

On Mon, Feb 24, 2020 at 7:15 PM Jan Lukavský  wrote:

> Congrats Jincheng!
>
>   Jan
>
> On 2/24/20 11:55 AM, Maximilian Michels wrote:
> > Hi everyone,
> >
> > Please join me and the rest of the Beam PMC in welcoming a new
> > committer: Jincheng Sun 
> >
> > Jincheng has worked on generalizing parts of Beam for Flink's Python
> > API. He has also picked up other issues, like fixing documentation,
> > implementing missing features, or cleaning up code [1].
> >
> > In consideration of his contributions, the Beam PMC trusts him with
> > the responsibilities of a Beam committer [2].
> >
> > Thank you for your contributions Jincheng!
> >
> > -Max, on behalf of the Apache Beam PMC
> >
> > [1]
> >
> https://jira.apache.org/jira/browse/BEAM-9299?jql=project%20%3D%20BEAM%20AND%20assignee%20in%20(sunjincheng121)
> > [2]
> >
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

Re: [VOTE][BIP-1] Beam Schema Options

2020-02-20 Thread Reza Rokni

+1 (non-binding) actually working on some pipelines where this would be
super useful to have now :-)

On Thu, Feb 20, 2020 at 10:09 PM Robert Burke  wrote:

> +1 (non-binding) it feels like a reasonable extension, and portability is
> considered. I see no issues with eventually supporting it in the Go SDK
>
> On Wed, Feb 19, 2020, 11:36 PM Alex Van Boxel  wrote:
>
>> Hi all,
>>
>> let's do a vote on the very first Beam Improvement Proposal. If you have
>> a -1 or -1 (binding) please add your concern to the open issues section to
>> the wiki. Thanks.
>>
>> This is the proposal:
>>
>> https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options
>>
>> Can I have your votes.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>

Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Reza Rokni

Fantastic news! Congratulations :-)

On Wed, 19 Feb 2020 at 07:54, jincheng sun  wrote:

> Congratulations！
> Best，
> Jincheng
>
>
> Robin Qiu 于2020年2月19日 周三05:52写道：
>
>> Congratulations, Alex!
>>
>> On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
>>> wrote:
>>>
 Thank you everyone!

  _/
 _/ Alex Van Boxel


 On Tue, Feb 18, 2020 at 7:05 PM  wrote:

> Congrats Alex!
> Jan
>
>
> Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :
>
> Congratulations!
>
>
> On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
> wrote:
>
> Congrats Alex! Well done!
>
> On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
> wrote:
>
> Congratulations!
>
> On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
> wrote:
>
> Congratulations Alex! Well deserved!
>
> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
> wrote:
>
> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming
> a new committer: Alex Van Boxel
>
> Alex has contributed to Beam in many ways - as an organizer for Beam
> Summit, and meetups - and also with the Protobuf extensions for schemas.
>
> In consideration of his contributions, the Beam PMC trusts him with
> the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Alex!
>
> Pablo, on behalf of the Apache Beam PMC.
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>
>
> --
>
> Best,
> Jincheng
> -
> Twitter: https://twitter.com/sunjincheng121
> -
>

Re: Cleaning up Approximate Algorithms in Beam

2020-02-18 Thread Reza Rokni

Hi,

I will be making time for completing this in March, with completion and
reviews planned for April.

Cheers

Reza

On Wed, 27 Nov 2019 at 02:29, Robert Bradshaw  wrote:

> I think this thread is sufficient.
>
> On Mon, Nov 25, 2019 at 5:59 PM Reza Rokni  wrote:
>
>> Hi,
>>
>> So do we need a vote for the final list of actions? Or is this thread
>> enough to go ahead and raise the PR's?
>>
>> Cheers
>>
>> Reza
>>
>> On Tue, 26 Nov 2019 at 06:01, Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
>>> wrote:
>>>
>>>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>>>>
>>>>> *Ahmet: FWIW, There is a python implementation only for this
>>>>> version: 
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>>>>  *
>>>>> Eventually we will be able to make use of cross language transforms to
>>>>> help with feature parity. Until then, are we ok with marking this
>>>>> deprecated in python, even though we do not have another solution. Or 
>>>>> leave
>>>>> it as is in Python now, as it does not have sketch capability so can only
>>>>> be used for outputting results directly from the pipeline.
>>>>>
>>>>
>>> If it is our intention to add the capability eventually, IMO it makes
>>> sense to mark the existing functionality deprecated in Python as well.
>>>
>>>
>>>>> *Reuven: I think this is the sort of thing that has been experimental
>>>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>>>> experimental as are all our file-based sinks). I think that many users use
>>>>> this, and probably store the state implicitly in streaming pipelines.*
>>>>> True, I have an old action item to try and go through and PR against
>>>>> old @experimental annotations but need to find time. So for this
>>>>> discussion; I guess this should be marked as deprecated if we change it
>>>>> even though its @experimental.
>>>>>
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> *Rob: I'm not following this--by naming things after their
>>>>> implementation rather than their intent I think they will be harder to
>>>>> search for. *
>>>>> This is to add to the name the implementation, after the intent. For
>>>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>>>> search for and it is clear which implementation is used. Allowing for a
>>>>> potentially better implementation ApproximateCountDistinct.
>>>>>
>>>>
>>>> OK, if we have both I'm more OK with that. This is better than the
>>>> names like HllCount, which seems to be what was suggested.
>>>>
>>>> Another approach would be to have a required  parameter which is an
>>>>> enum of the implementation options.
>>>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>>>
>>>>
>>>> Ideally this could be an optional parameter, or possibly only required
>>>> during update until we figure out a good way for the runner to plug this in
>>>> appropreately.
>>>>
>>>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>>>> needs of this thread to this larger discussion?
>>>>>
>>>>> Cheers
>>>>> Reza
>>>>>
>>>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles 
>>>>>> wrote:
>>>>>>
>>>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>>>
>>>>>>> 0. Never allow a combiner that does not include the format of its
>>>>>>> state clear in its name/URN. The "update compatibility" problem makes 
>>>>>>> their
>>>>>>> internal accumulator state essentially part of their public API. 
>>>>>>> Combiners
>>>>>>> named for what they do are an inherent risk, since we might have a new 
>>

Re: [ANNOUNCE] Beam 2.18.0 Released

2020-01-28 Thread Reza Rokni

Thank you Udi!

On Wed, 29 Jan 2020 at 06:34, Ahmet Altay  wrote:

> Thank you Udi!
>
> On Tue, Jan 28, 2020 at 2:13 PM kant kodali  wrote:
>
>> Looks like
>> https://beam.apache.org/documentation/runners/capability-matrix/ needs
>> to be updated? since there seems to be support for spark structured
>> streaming?
>>
>> On Tue, Jan 28, 2020 at 1:47 PM Connell O'Callaghan 
>> wrote:
>>
>>> Well done thank you Udi!!!
>>>
>>> On Tue, Jan 28, 2020 at 11:47 AM Ankur Goenka  wrote:
>>>
 Thanks Udi!

 On Tue, Jan 28, 2020 at 11:30 AM Yichi Zhang  wrote:

> Thanks Udi!
>
> On Tue, Jan 28, 2020 at 11:28 AM Hannah Jiang 
> wrote:
>
>> Thanks Udi!
>>
>>
>> On Tue, Jan 28, 2020 at 11:09 AM Pablo Estrada 
>> wrote:
>>
>>> Thanks Udi!
>>>
>>> On Tue, Jan 28, 2020 at 11:08 AM Rui Wang  wrote:
>>>
 Thank you Udi for taking care of Beam 2.18.0 release!



 -Rui

 On Tue, Jan 28, 2020 at 10:59 AM Udi Meiri  wrote:

> The Apache Beam team is pleased to announce the release of
> version 2.18.0.
>
> Apache Beam is an open source unified programming model to define
> and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements
> detailed on
> the Beam blog:
> https://beam.apache.org/blog/2020/01/13/beam-2.18.0.html
>
> Thanks to everyone who contributed to this release, and we hope
> you enjoy
> using Beam 2.18.0.
> -- Udi Meiri, on behalf of The Apache Beam team
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Hannah Jiang

2020-01-28 Thread Reza Rokni

Congratz!

On Wed, 29 Jan 2020 at 09:52, Valentyn Tymofieiev 
wrote:

> Congratulations, Hannah!
>
> On Tue, Jan 28, 2020 at 5:46 PM Udi Meiri  wrote:
>
>> Welcome and congrats Hannah!
>>
>> On Tue, Jan 28, 2020 at 4:52 PM Robin Qiu  wrote:
>>
>>> Congratulations, Hannah!
>>>
>>> On Tue, Jan 28, 2020 at 4:50 PM Alan Myrvold 
>>> wrote:
>>>
 Congrats, Hannah

 On Tue, Jan 28, 2020 at 4:46 PM Connell O'Callaghan <
 conne...@google.com> wrote:

> Thank you for sharing Luke!!!
>
> Well done and congratulations Hannah!!
>
> On Tue, Jan 28, 2020 at 4:45 PM Heejong Lee 
> wrote:
>
>> Congratulations! :)
>>
>> On Tue, Jan 28, 2020 at 4:43 PM Yichi Zhang 
>> wrote:
>>
>>> Congrats Hannah!
>>>
>>> On Tue, Jan 28, 2020 at 3:57 PM Yifan Zou 
>>> wrote:
>>>
 Congratulations Hannah!!

 On Tue, Jan 28, 2020 at 3:55 PM Boyuan Zhang 
 wrote:

> Thanks for all your contributions! Congratulations~
>
> On Tue, Jan 28, 2020 at 3:44 PM Pablo Estrada 
> wrote:
>
>> yoooho : D
>>
>> On Tue, Jan 28, 2020 at 3:21 PM Luke Cwik 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Hannah Jiang
>>>
>>> Hannah has contributed to Beam in many ways, including work on
>>> building and releasing the Apache Beam SDK containers.
>>>
>>> In consideration of their contributions, the Beam PMC trusts
>>> them with the responsibilities of a Beam committer[1].
>>>
>>> Thanks for your contributions Hannah!
>>>
>>> Luke, on behalf of the Apache Beam PMC.
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Michał Walenia

2020-01-27 Thread Reza Rokni

Congratulations buddy!

On Tue, 28 Jan 2020, 06:52 Valentyn Tymofieiev,  wrote:

> Congratulations, Michał!
>
> On Mon, Jan 27, 2020 at 2:24 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Nice -- keep up the good work!
>>
>> On Mon, Jan 27, 2020 at 2:02 PM Mikhail Gryzykhin 
>> wrote:
>> >
>> > Congratulations Michal!
>> >
>> > --Mikhail
>> >
>> > On Mon, Jan 27, 2020 at 1:01 PM Kyle Weaver 
>> wrote:
>> >>
>> >> Congratulations Michał! Looking forward to your future contributions :)
>> >>
>> >> Thanks,
>> >> Kyle
>> >>
>> >> On Mon, Jan 27, 2020 at 12:47 PM Pablo Estrada 
>> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Michał Walenia
>> >>>
>> >>> Michał has contributed to Beam in many ways, including the
>> performance testing infrastructure, and has even spoken at events about
>> Beam.
>> >>>
>> >>> In consideration of his contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer[1].
>> >>>
>> >>> Thanks for your contributions Michał!
>> >>>
>> >>> Pablo, on behalf of the Apache Beam PMC.
>> >>>
>> >>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

Re: Dynamic timers now supported!

2020-01-23 Thread Reza Rokni

Very cool ! Thank you Rehman!

On Fri, 24 Jan 2020, 06:12 Maximilian Michels,  wrote:

> Great work! That makes timers so much easier to use and also adds new
> use cases. Thank you Rehman.
>
> On 23.01.20 22:54, Robert Burke wrote:
> > Fascinating and great work! *makes notes for an eventual Go SDK
> > implementation*
> >
> > On Thu, Jan 23, 2020, 1:51 PM Luke Cwik  > > wrote:
> >
> > This is great. Thanks for the contribution Rehman.
> >
> > On Thu, Jan 23, 2020 at 10:09 AM Reuven Lax  > > wrote:
> >
> > Thanks to a lot of hard work by Rehman, Beam now supports
> > dynamic timers. As a reminder, this was discussed on the dev
> > list some time back.
> >
> > As background, previously one had to statically declare all
> > timers in your code. So if you wanted to have two timers, you
> > needed to create two timer variables and two callbacks - one for
> > each timer. A number of users kept hitting stumbling blocks
> > where they needed a dynamic set of timers (often based on the
> > element), which was not supported in Beam. The workarounds were
> > quite ugly and complicated.
> >
> > The new support allows declaring a TimerMap, which is a map of
> > timers. Each TimerMap is scoped by a family name, so you can
> > create multiple TimerMaps each with its own callback. The use
> > looks as follows:
> >
> > class MyDoFn extends DoFn<...> {
> > @TimerFamily("timers")
> > private final TimerSpec timerMap =
> > TimerSpecs.timerMap(TimeDomain.EVENT_TIME);
> >
> > @ProcessElement
> >  public void process(@TimerFamily("timers") TimerMap
> > timers, @Element Type e) {
> > timers.set("mainTimer", timestamp);
> > timers.set("actionType" + e.getActionType(), timestamp);
> > }
> >
> >@OnTimerFamily .
> >public void onTimer(@TimerId String timerId) {
> >   System.out.println("Timer fired. id: " + timerId);
> >}
> > }
> >
> > This currently works for the Flink and the Dataflow runners.
> >
> > Thank you Rehman for getting this done! Beam users will find it
> > very valuable.
> >
> > Reuven
> >
>

Re: [DISCUSS] Integrate Google Cloud AI functionalities

2020-01-20 Thread Reza Rokni

+1 for using cross language transforms.

On Thu, 16 Jan 2020 at 01:23, Ahmet Altay  wrote:

>
>
> On Wed, Jan 15, 2020 at 8:12 AM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> Based on your feedback, I think it'd be fine to deal with the problem as
>> follows:
>> * for Python: put the transforms into `sdks/python/apache_beam/io/gcp/ai`
>> * for Java: create a `google-cloud-platform-ai` module in
>> `sdks/java/extensions` folder
>>
>> As for cross language, we expect those transforms to be quite simple, so
>> the cost of implementing them twice is not that high.
>>
>
> One option would be to implement inference in a library like tfx_bsl [1].
> It comes with a generalized Beam transform that can do inference either
> from a saved model file or by using a service endpoint. The service
> endpoint API option is there and could support cloud AI APIs. If we utilize
> tfx_bsl, we will leverage the existing TFX integration and would avoid
> creating a parallel set of transforms. Then for Java, we could enable the
> same interface with cross language transform and offer a unified inference
> API for both languages.
>
> [1]
> https://github.com/tensorflow/tfx-bsl/blob/a9f5b6128309595570cc6212f8076e7a20063ac2/tfx_bsl/beam/run_inference.py#L78
>
>
>
>>
>> Thanks for your input,
>> Kamil
>>
>> On Wed, Jan 15, 2020 at 7:58 AM Alex Van Boxel  wrote:
>>
>>> If it's in Java also be careful to align with the current google cloud
>>> IO's, certainly it's dependencies. The google IO's are not depending on the
>>> the newest client libraries and that's something we're sometimes struggling
>>> with when we depend on our own client libraries. So make sure to align them.
>>>
>>> Also note that although gRPC is vendored, the google IO's do still have
>>> their own dependency on gRPC and this is the biggest reason for trouble.
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Wed, Jan 15, 2020 at 1:18 AM Luke Cwik  wrote:
>>>
 It depends on what language the client libraries are exposed in. For
 example, if the client libraries are in Java, sdks/java/extensions makes
 sense while if its Python then integrating it within the gcp extension
 within sdks/python/apache_beam makes sense.

 Adding additional dependencies is ok depending on the licensing and the
 process is slightly different for each language.

 For transforms that are complicated, there is a cross language effort
 going on so that one can execute one language's transforms within another
 languages pipeline which may remove the need to write the transforms more
 then once.

 On Tue, Jan 14, 2020 at 7:43 AM Ismaël Mejía  wrote:

> Nice idea, IO looks like a good place for them but there is another
> path that could fit this case: `sdks/java/extensions`, some module like
> `google-cloud-platform-ai` in that folder or something like that, no?
>
> In any case great initiative. +1
>
>
>
> On Tue, Jan 14, 2020 at 4:22 PM Kamil Wasilewski <
> kamil.wasilew...@polidea.com> wrote:
>
>> Hi all,
>>
>> We’d like to implement a set of PTransforms that would allow users to
>> use some of the Google Cloud AI services in Beam pipelines.
>>
>> Here's the full list of services and functionalities we’d like to
>> integrate Beam with:
>>
>> * Video Intelligence [1]
>>
>> * Cloud Natural Language [2]
>>
>> * Cloud AI Platform Prediction [3]
>>
>> * Data Masking/Tokenization [4]
>>
>> * Inspecting image data for sensitive information using Cloud Vision
>> [5]
>>
>> However, we're not sure whether to put those transforms directly into
>> Beam, because they would require some additional GCP dependencies. One of
>> our ideas is a separate library, that depends on Beam and that can be
>> installed optionally, stored somewhere in the beam repository (e.g. in 
>> the
>> BEAM_ROOT/extras directory). Do you think it is a reasonable approach? Or
>> maybe it is totally fine to put them into SDKs, just like other IOs?
>>
>> If you have any other thoughts, do not hesitate to let us know.
>>
>> Best,
>>
>> Kamil
>>
>> [1] https://cloud.google.com/video-intelligence/
>>
>> [2] https://cloud.google.com/natural-language/
>>
>> [3] https://cloud.google.com/ml-engine/docs/prediction-overview
>>
>> [4]
>> https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#dlptexttobigquerystreaming
>>
>> [5] https://cloud.google.com/vision/
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

2020-01-15 Thread Reza Rokni

+1 To this proposal, this is a very common pattern requirement from users.
With the following current workaround having seen a lot of traction:

https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs

Making this process simpler for users and Out Of the Box, would be a great
win!

I would also mention that ideally we will also cover the large distributed
side inputs, but a lot of the core cases for this comes down to Side inputs
that do fit in memory. Perhaps worth putting priorities on the work with
the smaller side input tables having precedence. Unless the work will cover
both cases in the same way of course.

Cheers

Reza

On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles  wrote:

> I do think that the implementation concerns around larger side inputs are
> relevant to most runners. Ideally there would be no model change necessary.
> Triggers are harder and bring in consistency concerns, which are even more
> likely to be relevant to all runners.
>
> Kenn
>
> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik  wrote:
>
>> Most of the doc is about how to support distributed side inputs in
>> Dataflow and doesn't really cover how the Beam model (accumulating,
>> discarding, retraction) triggers impact what are the "contents" of a
>> PCollection in time and how this proposal for a limited set of side input
>> shapes can work to support larger side inputs in Dataflow.
>>
>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský  wrote:
>>
>>> Hi Mikhail,
>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>
>>> inline
>>>
>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský  wrote:
>>>
 Hi,

 I actually thought that the proposal refers to Dataflow only. If this
 is supposed to be general, can we remove the Dataflow/Windmill specific
 parts and replace them with generic ones?

>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>
>>> Cool, thanks!
>>>
>>> I'd have two more questions:

  a) the proposal is named "Slowly changing", why is the rate of change
 essential to the proposal? Once running on event time, that should not
 matter, or what am I missing?

>>> Within this proposal, it is suggested to make a full snapshot of data on
>>> every re-read. This is generally expensive and setting time event to short
>>> interval might cause issues. Otherwise it is not essential.
>>>
>>> Understood. This relates to table-stream duality, where the requirements
>>> might relax once you don't have to convert table to stream by re-reading
>>> it, but by being able to retrieve updates as you go (example would be
>>> reading directly from kafka or any other "commit log" abstraction).
>>>
>>>  b) The description says: 'User wants to solve a stream enrichment
 problem. In brief request sounds like: ”I want to enrich each event in this
 stream by corresponding data from given table.”'. That is understandable,
 but would it be better to enable the user to express this intent directly
 (via Join operation)? The actual implementation might be runner (and
 input!) specific. The analogy is that when doing group-by-key operation,
 runner can choose hash grouping or sort-merge grouping, but that is not
 (directly) expressed in user code. I'm not saying that we should not have
 low-level transforms, just asking if it would be better to leave this
 decision to the runner (at least in some cases). It might be the case that
 we want to make core SDK as low level as possible (and as reasonable), I
 just want to make sure that that is really the intent.

>>> The idea is to add basic operation with as small change as possible for
>>> current API.
>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>> strategy. However, I don't think that we have proper tools and view of how
>>> to choose best strategy at hand as of yet.
>>>
>>> OK, cool. That is where I would find it very much useful to have some
>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>> missing in the puzzle as of now. But it would be good to know what these
>>> pieces are and what needs to be done to fulfill our goals. But this is
>>> probably not related to discussion of this proposal, but more related to
>>> the concept of BIP or similar.
>>>
>>> Thanks for the explanation.
>>>
>>> Thanks for the proposal!

 Jan
 On 12/17/19 12:01 AM, Kenneth Knowles wrote:

 I want to highlight that this design works for definitely more runners
 than just Dataflow. I see two pieces of it that I want to bring onto the
 thread:

 1. A new kind of "unbounded source" which is a periodic refresh of a
 bounded source, and use that as a side input. Each main input element has a
 window that maps to a specific refresh of the side input.
 2. Distributed map side inputs: supporting very large lookup tables,
 but with consistency challenges. Even the part

Unset / delete Timers

2020-01-07 Thread Reza Rokni

Hi,

Was exploring the ability to add unset / reset option for Timer, would this
be an expensive operation for runners to support?

More complex State and Timer use cases can require this operation and while
its possible to do today using a separate State Object, its heavy on boiler
plate and cumbersome.

Cheers
Reza







-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Beam Testing Tools FAQ

2019-12-03 Thread Reza Rokni

Thanx!

On Wed, 27 Nov 2019, 02:31 Pablo Estrada,  wrote:

> Very cool. Thanks Lukasz!
>
> On Tue, Nov 26, 2019 at 9:41 AM Alan Myrvold  wrote:
>
>> Nice, thanks!
>>
>> On Tue, Nov 26, 2019 at 8:04 AM Robert Bradshaw 
>> wrote:
>>
>>> Thanks!
>>>
>>> On Tue, Nov 26, 2019 at 7:43 AM Łukasz Gajowy 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > our documentation (either confluence or the website docs) describes
>>> how to create various integration and performance tests - there already are
>>> core operations tests, nexmark and IO test documentation pages. However, we
>>> are lacking some general docs to describe what tools do we have and what is
>>> the purpose of them.
>>> >
>>> > Therefore, I took the liberty of creating the Beam Testing Tools FAQ
>>> on our confluence:
>>> >
>>> https://cwiki.apache.org/confluence/display/BEAM/Beam+Testing+Tools+FAQ
>>> >
>>> > Hopefully, this is helpful and sheds some more light on that important
>>> part of our infrastructure. If you feel that something is missing there,
>>> feel free to let me know or add it yourself. :)
>>> >
>>> > Thanks,
>>> > Łukasz
>>>
>>

Re: Full stream-stream join semantics

2019-11-27 Thread Reza Rokni

 work out the definition of the join you are
>>>>>> interested in, with a good amount of mathematical rigor, and then 
>>>>>> consider
>>>>>> the ways you can implement it. That is where a design doc will probably
>>>>>> clarify things.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>  b) until retractions are 100% functional (and that is sort of holy
>>>>>>> grail for now), then the only solution is using a buffer holding data 
>>>>>>> up to
>>>>>>> watermark *and then sort by event time*
>>>>>>>
>>>>>>  c) even if retractions were 100% functional, there would have to be
>>>>>>> special implementation for batch case, because otherwise this would 
>>>>>>> simply
>>>>>>> blow up downstream processing with insanely many false additions and
>>>>>>> subsequent retractions
>>>>>>>
>>>>>>> Property b) means that if we want this feature now, we must sort by
>>>>>>> event time and there is no way around. Property c) shows that even in 
>>>>>>> the
>>>>>>> future, we must make (in certain cases) distinction between batch and
>>>>>>> streaming code paths, which seems weird to me, but it might be an 
>>>>>>> option.
>>>>>>> But still, there is no way to express this join in batch case, because 
>>>>>>> it
>>>>>>> would require either buffering (up to) whole input on local worker 
>>>>>>> (doesn't
>>>>>>> look like viable option) or provide a way in user code to signal the 
>>>>>>> need
>>>>>>> for ordering of data inside GBK (and we are there again :)). Yes, we 
>>>>>>> might
>>>>>>> shift this need from stateful dofn to GBK like
>>>>>>>
>>>>>>>  input.apply(GroupByKey.sorted())
>>>>>>>
>>>>>>> I cannot find a good reasoning why this would be better than giving
>>>>>>> this semantics to (stateful) ParDo.
>>>>>>>
>>>>>>> Maybe someone can help me out here?
>>>>>>>
>>>>>>> Jan
>>>>>>> On 11/24/19 5:05 AM, Kenneth Knowles wrote:
>>>>>>>
>>>>>>> I don't actually see how event time sorting simplifies this case
>>>>>>> much. You still need to buffer elements until they can no longer be 
>>>>>>> matched
>>>>>>> in the join, and you still need to query that buffer for elements that
>>>>>>> might match. The general "bi-temporal join" (without sorting) requires 
>>>>>>> one
>>>>>>> new state type and then it has identical API, does not require any novel
>>>>>>> data structures or reasoning, yields better latency (no sort buffer 
>>>>>>> delay),
>>>>>>> and discards less data (no sort buffer cutoff; watermark is better).
>>>>>>> Perhaps a design document about this specific case would clarify.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Fri, Nov 22, 2019 at 10:08 PM Jan Lukavský 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I didn't want to go too much into detail, but to describe the idea
>>>>>>>> roughly (ignoring the problem of different window fns on both sides to 
>>>>>>>> keep
>>>>>>>> it as simple as possible):
>>>>>>>>
>>>>>>>> rhs -  \
>>>>>>>>
>>>>>>>> flatten (on global window)  stateful par do
>>>>>>>> (sorted by event time)   output
>>>>>>>>
>>>>>>>> lhs -  /
>>>>>>>>
>>>>>>>> If we can guarantee event time order arrival of events into the
>>>>>>>> stateful pardo, then the whole complexity reduces to keep current 
>>>>>>>> value of
>>>>>>>> left and right element and just flush them out each time there is an
>>>>>>>> update. That is the "knob" is actually w

Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-26 Thread Reza Rokni

Hi Alexey,

With regards to @Experimental there are a couple of discussions around its
usage ( or rather over usage! ) on dev@. It is something that we need to
clean up ( some of those IO are now being used on production env for
years!).

Cheers

Reza

On Wed, 27 Nov 2019 at 04:50, Luke Cwik  wrote:

> I suggested the wrapper because sometimes the intent of the APIs can be
> translated easily but this is not always the case.
>
> Good to know that it is all marked @Experimental.
>
> On Tue, Nov 26, 2019 at 12:30 PM Cam Mach  wrote:
>
>> Thank you, Alex for sharing the information, and Luke for the questions.
>> I like the idea that just depreciate the V1 IOs, and just maintain V2
>> IOs, so we can support whoever want continue with V1.
>> Just as Alex said, a lot of users, including my teams :-) , use the V1
>> IOs in production for real workload. So it'll be hard to remove V1 IOs or
>> force them migrate to V2. But let hear if there are any other ideas?
>>
>> Btw, making V1 a wrapper around V2 is not very positive, code will get
>> more complicated since V2 API is very different from V1's.
>>
>> Thanks,
>>
>>
>>
>> On Tue, Nov 26, 2019 at 8:21 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> AFAICT, all AWS SDK V1 IOs (SnsIO, SqsIO, DynamoDBIO, KinesisIO) are
>>> marked as "Experimental". So, it should not be a problem to gracefully
>>> deprecate and finally remove them. We already did the similar procedure for
>>> “HadoopInputFormatIO”, which was renamed to just “HadoopFormatIO” (since it
>>> started to support HadoopOutputFormatI as well). Old “HadoopInputFormatIO”
>>> was deprecated and removed after *3 consecutive* Beam releases (as we
>>> agreed on mailing list).
>>>
>>> In the same time, some users for some reasons would not be able or to
>>> want to move on AWS SDK V2. So, I’d prefer to just deprecate AWS SDK V1 IOs
>>> and accept new features/fixes *only* for V2 IOs.
>>>
>>> Talking about “Experimental” annotation. Sorry in advance If I missed
>>> that and switch a subject a bit, but do we have clear rules or an agreement
>>> when IO becomes stable and should not be marked as experimental anymore?
>>> *Most* of our Java IOs are marked as Experimental but many of them were
>>> using in production by real users under real load. Does it mean that they
>>> are ready to be stable in terms of API? Perhaps, this topic deserves a new
>>> discussion if there are several opinions on that.
>>>
>>> On 26 Nov 2019, at 00:39, Luke Cwik  wrote:
>>>
>>> Phase I sounds fine.
>>>
>>> Apache Beam follows semantic versioning and I believe removing the IOs
>>> will be a backwards incompatible change unless they were marked
>>> experimental which will be a problem for Phase 2.
>>>
>>> What is the feasibility of making the V1 transforms wrappers around V2?
>>>
>>> On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:
>>>
 Hello Beam Devs,

 I have been working on the migration of Amazon Web Services IO
 connectors into the new AWS SDK for Java V2. The goal is to have an updated
 implementation aligned with the most recent AWS improvements. So far we
 have already migrated the connectors for AWS SNS, SQS and  DynamoDB.

 In the meantime some contributions are still going on V1 IOs. So far we
 have dealt with those by porting (or asking contributors) to port the
 changes into V2 IOs too because we don’t want features of both versions to
 be unaligned but this may quickly become a maintenance issue, so we want to
 discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
 move to V2.

 Phase I (ASAP):

- Mark migrated AWS V1 IOs as deprecated
- Document migration path to V2

 Phase II (end of 2020):

- Decide a date or Beam release to remove the V1 IOs
- Send a notification to the community 3 months before we remove
them
- Completely get rid of V1 IOs

 Please let me know what you think or if you see any potential issues?

 Thanks,
 Cam Mach

>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

2019-11-25 Thread Reza Rokni

Hi,

So do we need a vote for the final list of actions? Or is this thread
enough to go ahead and raise the PR's?

Cheers

Reza

On Tue, 26 Nov 2019 at 06:01, Ahmet Altay  wrote:

>
>
> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
> wrote:
>
>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>>
>>> *Ahmet: FWIW, There is a python implementation only for this
>>> version: 
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
>>>  *
>>> Eventually we will be able to make use of cross language transforms to
>>> help with feature parity. Until then, are we ok with marking this
>>> deprecated in python, even though we do not have another solution. Or leave
>>> it as is in Python now, as it does not have sketch capability so can only
>>> be used for outputting results directly from the pipeline.
>>>
>>
> If it is our intention to add the capability eventually, IMO it makes
> sense to mark the existing functionality deprecated in Python as well.
>
>
>>> *Reuven: I think this is the sort of thing that has been experimental
>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>> experimental as are all our file-based sinks). I think that many users use
>>> this, and probably store the state implicitly in streaming pipelines.*
>>> True, I have an old action item to try and go through and PR against
>>> old @experimental annotations but need to find time. So for this
>>> discussion; I guess this should be marked as deprecated if we change it
>>> even though its @experimental.
>>>
>>
>> Agreed.
>>
>>
>>> *Rob: I'm not following this--by naming things after their
>>> implementation rather than their intent I think they will be harder to
>>> search for. *
>>> This is to add to the name the implementation, after the intent. For
>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>> search for and it is clear which implementation is used. Allowing for a
>>> potentially better implementation ApproximateCountDistinct.
>>>
>>
>> OK, if we have both I'm more OK with that. This is better than the names
>> like HllCount, which seems to be what was suggested.
>>
>> Another approach would be to have a required  parameter which is an enum
>>> of the implementation options.
>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>
>>
>> Ideally this could be an optional parameter, or possibly only required
>> during update until we figure out a good way for the runner to plug this in
>> appropreately.
>>
>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>> needs of this thread to this larger discussion?
>>>
>>> Cheers
>>> Reza
>>>
>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
>>> wrote:
>>>
>>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>> Wow. Nice summary, yes. Major calls to action:
>>>>>
>>>>> 0. Never allow a combiner that does not include the format of its
>>>>> state clear in its name/URN. The "update compatibility" problem makes 
>>>>> their
>>>>> internal accumulator state essentially part of their public API. Combiners
>>>>> named for what they do are an inherent risk, since we might have a new way
>>>>> to do the same operation with different implementation-detail state.
>>>>>
>>>>
>>>> It seems this will make for a worse user experience, motivated solely
>>>> by limitations in our implementation. I think we can do better.
>>>> Hypothetical idea: what if upgrade required access to the original graph
>>>> (or at least metadata about it) during construction? In this case an
>>>> ApproximateDistinct could look at what was used last time and try to do the
>>>> same, but be free to do something better when unconstrained. Another
>>>> approach would be to encode several alternative expansions in the Beam
>>>> graph and let the runner do the picking (based on prior submission).
>>>> (Making the CombineFn, as opposed to the composite, have several
>>>> alternatives seems harder to reason about, but maybe worth pursuing as
>>>> well).
>>>>
>>>> This is n

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-19 Thread Reza Rokni

[ ] Beaver
[ ] Hedgehog
[ ] Lemur
[ ] Owl
[X] Salmon
[ ] Trout
[ ] Robot dinosaur
[ ] Firefly
[ ] Cuttlefish
[X] Dumbo Octopus
[X] Angler fish

On Wed, 20 Nov 2019 at 10:43, Kenneth Knowles  wrote:

> Please cast your votes of approval [1] for animals you would support as
> Beam mascot. The animal with the most approval will be identified as the
> favorite.
>
> *** Vote for as many as you like, using this checklist as a template 
>
> [ ] Beaver
> [ ] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [ ] Firefly
> [ ] Cuttlefish
> [ ] Dumbo Octopus
> [ ] Angler fish
>
> This vote will remain open for at least 72 hours.
>
> Kenn
>
> [1] See https://en.wikipedia.org/wiki/Approval_voting#Description and
> https://www.electionscience.org/library/approval-voting/
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Cleaning up Approximate Algorithms in Beam

2019-11-17 Thread Reza Rokni

*Ahmet: FWIW, There is a python implementation only for this
version: 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38>
*
Eventually we will be able to make use of cross language transforms to help
with feature parity. Until then, are we ok with marking this deprecated in
python, even though we do not have another solution. Or leave it as is in
Python now, as it does not have sketch capability so can only be used for
outputting results directly from the pipeline.

*Reuven: I think this is the sort of thing that has been experimental
forever, and therefore not experimental (e.g. the entire triggering API is
experimental as are all our file-based sinks). I think that many users use
this, and probably store the state implicitly in streaming pipelines.*
True, I have an old action item to try and go through and PR against
old @experimental annotations but need to find time. So for this
discussion; I guess this should be marked as deprecated if we change it
even though its @experimental.

*Rob: I'm not following this--by naming things after their implementation
rather than their intent I think they will be harder to search for. *
This is to add to the name the implementation, after the intent. For
example ApproximateCountDistinctZetaSketch, I believe should be easy to
search for and it is clear which implementation is used. Allowing for a
potentially better implementation ApproximateCountDistinct. Another
approach would be to have a required  parameter which is an enum of the
implementation options. ApproximateCountDistinct.of().usingImpl(ZETA) ?

Rob/Kenn: On Combiner discussion, should we tie action items from the needs
of this thread to this larger discussion?

Cheers
Reza

On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw  wrote:

> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles  wrote:
>
>> Wow. Nice summary, yes. Major calls to action:
>>
>> 0. Never allow a combiner that does not include the format of its state
>> clear in its name/URN. The "update compatibility" problem makes their
>> internal accumulator state essentially part of their public API. Combiners
>> named for what they do are an inherent risk, since we might have a new way
>> to do the same operation with different implementation-detail state.
>>
>
> It seems this will make for a worse user experience, motivated solely by
> limitations in our implementation. I think we can do better. Hypothetical
> idea: what if upgrade required access to the original graph (or at least
> metadata about it) during construction? In this case an ApproximateDistinct
> could look at what was used last time and try to do the same, but be free
> to do something better when unconstrained. Another approach would be to
> encode several alternative expansions in the Beam graph and let the runner
> do the picking (based on prior submission). (Making the CombineFn, as
> opposed to the composite, have several alternatives seems harder to reason
> about, but maybe worth pursuing as well).
>
> This is not unique to Combiners, but any stateful DoFn, or composite
> operations with non-trivial internal structure (and coders). This has been
> discussed a lot, perhaps there are some ideas there we could borrow?
>
> And they will match search terms better, which is a major problem.
>>
>
> I'm not following this--by naming things after their implementation rather
> than their intent I think they will be harder to search for.
>
>
>> 1. Point users to HllCount. This seems to be the best of the three. Does
>> it have a name that is clear enough about the format of its state? Noting
>> that its Java package name includes zetasketch, perhaps.
>> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>>
>
> +1
>
>
>> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay  wrote:
>>>
>>>> Thank you for writing this summary.
>>>>
>>>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni  wrote:
>>>>
>>>>> Hi everyone;
>>>>>
>>>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>>>> algorithms.
>>>>>
>>>>> Today there are several options for Approximate Algorithms in Apache
>>>>> Beam 2.16 with HLLCount being the most recently added. Would like to 
>>>>> canvas
>>>>> opinions here on the possibility of rationalizing these API's by removing
>>>>> obsolete / less efficient implementations.
>>>>> The current situation:
>>>>>
>>

Re: [PROPOSAL] Add support for writing flattened schemas to pubsub

2019-11-17 Thread Reza Rokni

+1 to reduced boiler plate for basic things folks want to do with SQL.

I like Alex use of Option for more advanced use cases.

On Sun, 17 Nov 2019 at 20:17, Gleb Kanterov  wrote:

> Expanding on what Kenn said regarding having fewer dependencies on SQL.
> Can the whole thing be seen as extending PubSubIO, that would implement
> most of the logic from the proposal, given column annotations, and then
> having a thin layer that connects it with Beam SQL tables?
>
> On Sun, Nov 17, 2019 at 12:38 PM Alex Van Boxel  wrote:
>
>> I like it, but I'm worried about the magic event_timestamp injection.
>> Wouldn't explicit injection via option not be a better approach:
>>
>> CREATE TABLE people (
>> my_timestamp TIMESTAMP *OPTION(ref="pubsub:event_timestamp)*,
>> my_id VARCHAR *OPTION(ref="pubsub:attributes['id_name']")*,
>> name VARCHAR,
>> age INTEGER
>>   )
>>   TYPE 'pubsub'
>>   LOCATION 'projects/my-project/topics/my-topic'
>>
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Sat, Nov 16, 2019 at 7:58 PM Kenneth Knowles  wrote:
>>
>>> Big +1 from me.
>>>
>>> Nice explanation. This makes a lot of sense. Much simpler to understand
>>> with fewer magic strings. It also makes the Beam SQL connector less
>>> dependent on newer SQL features that are simply less widespread. I'm not
>>> too surprised that Calcite's nested row support lags behind the rest of the
>>> library. It simply isn't as widespread and important as flat relational
>>> structures. And MAP is even less widespread.
>>>
>>> Kenn
>>>
>>> On Wed, Nov 13, 2019 at 12:32 PM Brian Hulette 
>>> wrote:
>>>
 I've been looking into adding support for writing (i.e. INSERT INTO
 statements) for the pubsub DDL, which currently only supports reading. This
 DDL requires the defined schema to have exactly three fields:
 event_timestamp, attributes, and payload, corresponding to the fields in
 PubsubMessage (event_timestamp can be configured to come from either
 publish time or from the value in a particular attribute, and the payload
 must be a ROW with a schema corresponding to the JSON written to the pubsub
 topic).

 When writing, I think it's a bit onerous to require users to use
 exactly these three top-level fields. For example imagine we have two
 topics: people, and eligible_voters. people contains a stream of {"name":
 "..", age: XX} items, and we want eligible_voters to contain a stream with
 {"name": ".."} items corresponding to people with age >= 18. With the
 current approach this would look like:

 ```
 CREATE TABLE people (
 event_timestamp TIMESTAMP,
 attributes MAP,
 payload ROW
   )
   TYPE 'pubsub'
   LOCATION 'projects/my-project/topics/my-topic'

 CREATE TABLE eligible_voters 

 INSERT INTO eligible_voters (
   SELECT
 ROW(payload.name AS name) AS payload
 FROM people
 WHERE payload.age >= 18
 )
 ```

 This query has lots of renaming and boiler-plate, and furthermore,
 ROW(..) doesn't seem well supported in Calcite, I had to jump through some
 hoops (like calling my fields $col1), to make something like this work.
 I think it would be great if we could instead handle flattened,
 payload-only schemas. We would still need to have a separate
 event_timestamp field, but everything else would map to a field in the
 payload. With this change the previous example would look like:

 ```
 CREATE TABLE people (
 event_timestamp TIMESTAMP,
 name VARCHAR,
 age INTEGER
   )
   TYPE 'pubsub'
   LOCATION 'projects/my-project/topics/my-topic'

 CREATE TABLE eligible_voters ...

 INSERT INTO eligible_voters (
   SELECT
 name
 FROM people
 WHERE age >= 18
 )
 ```

 This is much cleaner! But the overall approach has an obvious downside
 - with the tabke definition written like this it's impossible to read from
 or write to the message attributes (unless one is being used for
 event_timestamp). I think we can mitigate this in two ways:
 1. In the future, this flattened schema definition could be represented
 as something like a view on the expanded definition. We could allow users
 to provide some metadata indicating that a column should correspond to a
 particular attribute, rather than a field in the payload. To me this feels
 similar to how you indicate a column should be indexed in a database. It's
 data that's relevant to the storage system, and not to the actual query, so
 it belongs in CREATE TABLE.
 2. In the meantime, we can continue to support the current syntax. If a
 pubsub table definition has *exactly* three fields with the expected types:
 event_timestamp TIMESTAMP, payload ROW<...>, and attributes MAP>>> VARCHAR>, we can continue to use the current codepath. Otherwise we will
 use the

Re: [ANNOUNCE] New committer: Brian Hulette

2019-11-15 Thread Reza Rokni

Great news!

On Fri, 15 Nov 2019 at 15:09, Gleb Kanterov  wrote:

> Congratulations!
>
> On Fri, Nov 15, 2019 at 5:44 AM Valentyn Tymofieiev 
> wrote:
>
>> Congratulations, Brian!
>>
>> On Thu, Nov 14, 2019 at 6:25 PM jincheng sun 
>> wrote:
>>
>>> Congratulation Brian!
>>>
>>> Best,
>>> Jincheng
>>>
>>> Kyle Weaver  于2019年11月15日周五 上午7:19写道：
>>>
 Thanks for your contributions and congrats Brian!

 On Thu, Nov 14, 2019 at 3:14 PM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Brian Hulette
>
> Brian introduced himself to dev@ earlier this year and has been
> contributing since then. His contributions to Beam include explorations of
> integration with Arrow, standardizing coders, portability for schemas, and
> presentations at Beam events.
>
> In consideration of Brian's contributions, the Beam PMC trusts him
> with the responsibilities of a Beam committer [1].
>
> Thank you, Brian, for your contributions and looking forward to many
> more!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Cleaning up Approximate Algorithms in Beam

2019-11-12 Thread Reza Rokni

Hi everyone;

TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms.

Today there are several options for Approximate Algorithms in Apache Beam
2.16 with HLLCount being the most recently added. Would like to canvas
opinions here on the possibility of rationalizing these API's by removing
obsolete / less efficient implementations.
The current situation:

There are three options available to users: ApproximateUnique.java
,
ApproximateDistinct.java

and HllCount.java
.
A quick summary of these API's as I understand them:

HllCount.java
:
Marked as @Experimental

PTransforms to compute HyperLogLogPlusPlus (HLL++) sketches on data streams
based on the ZetaSketch 
implementation.Detailed design of this class, see
https://s.apache.org/hll-in-beam.

ApproximateUnique.java
:
Not Marked with experimental

This API does not expose the ability to create sketches so it's not
suitable for the OLAP use case that HLL++ is geared towards (do
pre-aggregation into sketches to allow interactive query speeds). It's also
less precise for the same amount of memory used: the error bounds in the
doc comments give :

/* The error is about

{@code 2 * / sqrt(sampleSize)},) */

Compared to the default HLLCount sketch size, its error is 10X larger than
the HLL++ error.

ApproximateDistinct.java

Marked with @Experimental

This is a re-implementation of the HLL++ algorithm, based on the paper
published in 2013. It is exposing sketches via a HyperLogLogPlusCoder. We
have not run any benchmarks to compare this implementation compared to the
HLLCount and we need to be careful to ensure that if we were to change any
of these API's that the binary format of the sketches should never change,
there could be users who have stored previous sketches using
ApproximateDistinct and it will be important to try and ensure they do not
have a bad experience.


Proposals:

There are two classes of users expected for these algorithms:

1) Users who simply use the transform to estimate the size of their data
set in Beam

2) Users who want to create sketches and store them, either for
interoperability with other systems, or as features to be used in further
data processing.



For use case 1, it is possible to make use of naming which does not expose
the implementation, however for use case 2 it is important for the
implementation to be explicit as sketches produced with one implementation
will not work with other implementations.

ApproximateUnique.java

:

This one does not provide sketches and based on the notes above, is not as
efficient as HLLCount. However it does have a very searchable name and is
likely to be the one users will gravitate to when searching for Approximate
unique algorithms but do not need the capabilities of sketches.

Ideally we should think about switching the implementation of this
transform to wrap HLLCount. However this could mean changing things in a
way which is not transparent to the end developer.  Although as a result of
the change they would get a better implementation for free on an upgrade :-)

Another option would be to mark this transform as @Deprecated and create a
new transform ApproximateCountDistinct which would wrap HLLCount. The name
is also much clearer.

ApproximateDistinct.java


This transform does generate sketches as output and given its marked as
@Experimental, one option we would have is to create a name which includes
the algorithm implementation details, for example
ApproximateCountDistinctClearSpring.



HllCount.java

.

Again we have a few options here, as the name does not include search words
like

Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-10 Thread Reza Rokni

Hi,

I think doing something in Asia later in the year would be very cool, I
think given there has not yet been a lot of formal activity there, we
should give plenty of time to ensure good local sponsorship etc...

Cheers

Reza

On Fri, 8 Nov 2019 at 19:10, jincheng sun  wrote:

> +1 for extend the discussion to the user mailing list?
>
> Maximilian Michels  于2019年11月8日周五 下午6:32写道：
>
>> The dates sounds good to me. I agree that the bay area has an advantage
>> because of its large tech community. On the other hand, it is a question
>> of how we run the event. For Berlin we managed to get about 200
>> attendees to Berlin, but for the BeamSummit in Las Vegas with ApacheCon
>> the attendance was much lower.
>>
>> Should this also be discussed on the user mailing list?
>>
>> Cheers,
>> Max
>>
>> On 07.11.19 22:50, Alex Van Boxel wrote:
>> > For date wise, I'm wondering why we should switching the Europe and NA
>> > one, this would mean that the Berlin and the new EU summit would be
>> > almost 1.5 years apart.
>> >
>> >   _/
>> > _/ Alex Van Boxel
>> >
>> >
>> > On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay > > > wrote:
>> >
>> > I prefer bay are for NA summit. My reasoning is that there is a
>> > criticall mass of contributors and users in that location, probably
>> > more than alternative NA locations. I was not involved with planning
>> > recently and I do not know if there were people who could attend due
>> > to location previously. If that is the case, I agree with Elliotte
>> > on looking for other options.
>> >
>> > Related to dates: March (Asia) and mid-May (NA) dates are a bit
>> > close. Mid-June for NA might be better to spread events. Other
>> > pieces looks good.
>> >
>> > Ahmet
>> >
>> > On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
>> > mailto:elh...@ibiblio.org>> wrote:
>> >
>> > The U.S. sadly is not a reliable destination for international
>> > conferences these days. Almost every conference I go to, big and
>> > small, has at least one speaker, sometimes more, who can't get
>> into
>> > the country. Canada seems worth considering. Vancouver,
>> > Montreal, and
>> > Toronto are all convenient.
>> >
>> > On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas > > > wrote:
>> >  >
>> >  > Hi Beam Community!
>> >  >
>> >  > I'd like to kick off a thread to discuss potential dates and
>> > venues for the 2020 Beam Summits.
>> >  >
>> >  > I did some research on industry conferences happening in 2020
>> > and pre-selected a few ranges as follows:
>> >  >
>> >  > (2 days) NA between mid-May and mid-June
>> >  > (2 days) EU mid October
>> >  > (1 day) Asia Mini Summit:  March
>> >  >
>> >  > I'd like to hear your thoughts on these dates and get
>> > consensus on exact dates as the convo progresses.
>> >  >
>> >  > For locations these are the options I reviewed:
>> >  >
>> >  > NA: Austin Texas, Berkeley California, Mexico City.
>> >  > Europe: Warsaw, Barcelona, Paris
>> >  > Asia: Singapore
>> >  >
>> >  > Let the discussion begin!
>> >  > G (on behalf of the Beam Summit Steering Committee)
>> >  >
>> >  >
>> >  >
>> >
>> >
>> > --
>> > Elliotte Rusty Harold
>> > elh...@ibiblio.org 
>> >
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [Discuss] Beam mascot

2019-11-07 Thread Reza Rokni

Salmon... they love streams? :-)

On Fri, 8 Nov 2019 at 12:00, Kenneth Knowles  wrote:

> Agree with Aizhamal that it doesn't matter if they are taken if they are
> not too close in space to Beam: Apache projects, big data, log processing,
> stream processing. Not a legal opinion, but an aesthetic opinion. So I
> would keep Lemur as a possibility. Definitely nginx is far away from Beam
> so it seems OK as long as the art is different.
>
> Also FWIW there are many kinds of Lemurs, and also related Tarsier, of the
> only uncontroversial and non-extinct infraorder within
> suborder Strepsirrhini. I think there's enough room for another mascot with
> big glowing eyes :-). The difference in the designer's art will be more
> significant than the taxonomy.
>
> Kenn
>
> On Tue, Nov 5, 2019 at 4:37 PM Aizhamal Nurmamat kyzy 
> wrote:
>
>> Aww.. that Hoover beaver is cute. But then lemur is also "taken" [1] and
>> the owl too [2].
>>
>> Personally, I don't think it matters much which mascots are taken, as
>> long as the project is not too close in the same space as Beam. Also, it's
>> good to just get all ideas out. We should still consider hedgehogs. I
>> looked up fireflies, they don't look nice, but i am not dismissing the idea
>> :/
>>
>> And thanks for reaching out to designers, Max. To your point:
>> >how do we arrive at a concrete design
>> >once we have consensus on the type of mascot?
>> My thinking is that the designer will come up with few sketches, then we
>> vote on one here in the dev@ list.
>>
>> [1]
>> https://www.nginx.com/blog/introducing-the-lemur-stack-and-an-official-nginx-mascot/
>> [2] https://blog.readme.io/why-every-startup-needs-a-mascot/
>>
>> On Tue, Nov 5, 2019 at 5:31 AM Maximilian Michels  wrote:
>>
>>> Quick update: The mentioned designer has gotten back to me and offered
>>> to sketch something until the end of the week. I've pointed him to this
>>> thread and the existing logo material:
>>> https://beam.apache.org/community/logos/
>>>
>>> [I don't want to interrupt the discussion in any way, I just think
>>> having something concrete will help us to eventually decide what we
>>> want.]
>>>
>>> On 05.11.19 12:49, Maximilian Michels wrote:
>>> > How about fireflies in the Beam light rays? ;)
>>> >
>>> >> Feels like "Beam" would go well with an animal that has glowing
>>> bright
>>> >> eyes such as a lemur
>>> >
>>> > I love the lemur idea because it has almost orange eyes.
>>> >
>>> > Thanks for starting this Aizhamal! I've recently talked to a designer
>>> > which is somewhat famous for creating logos. He was inclined to work
>>> on
>>> > a software project logo. Of course there is a little bit of a price
>>> tag
>>> > attached, though the quote sounded reasonable.
>>> >
>>> > It raises the general question, how do we arrive at a concrete design
>>> > once we have consensus on the type of mascot? I believe there are also
>>> > designers working at companies using Beam ;)
>>> >
>>> > Cheers,
>>> > Max
>>> >
>>> > On 05.11.19 06:14, Eugene Kirpichov wrote:
>>> >> Feels like "Beam" would go well with an animal that has glowing
>>> bright
>>> >> eyes (with beams of light shooting out of them), such as a lemur [1]
>>> >> or an owl.
>>> >>
>>> >> [1] https://www.cnn.com/travel/article/madagascar-lemurs/index.html
>>> >>
>>> >> On Mon, Nov 4, 2019 at 7:33 PM Kenneth Knowles >> >> > wrote:
>>> >>
>>> >> Yes! Let's have a mascot!
>>> >>
>>> >> Direct connections often have duplicates. For example in the log
>>> >> processing space, there is
>>> https://www.linkedin.com/in/hooverbeaver
>>> >>
>>> >> I like a flying squirrel, but Flink already is a squirrel.
>>> >>
>>> >> Hedgehog? I could not find any source of confusion for this one.
>>> >>
>>> >> Kenn
>>> >>
>>> >>
>>> >> On Mon, Nov 4, 2019 at 6:02 PM Robert Burke >> >> > wrote:
>>> >>
>>> >> As both a Canadian, and the resident fan of a programming
>>> >> language with a rodent mascot, I endorse this mascot.
>>> >>
>>> >> On Mon, Nov 4, 2019, 4:11 PM David Cavazos <
>>> dcava...@google.com
>>> >> > wrote:
>>> >>
>>> >> I like it, a beaver could be a cute mascot :)
>>> >>
>>> >> On Mon, Nov 4, 2019 at 3:33 PM Aizhamal Nurmamat kyzy
>>> >> mailto:aizha...@apache.org>> wrote:
>>> >>
>>> >> Hi everybody,
>>> >>
>>> >> I think the idea of creating a Beam mascot has been
>>> >> brought up a couple times here in the past, but I
>>> would
>>> >> like us to go through with it this time if we are all
>>> in
>>> >> agreement:)
>>> >>
>>> >> We can brainstorm in this thread what the mascot
>>> should
>>> >> be given Beam’s characteristics and principles. What
>>> do
>>> >> you all think?
>>> >>
>>> >> For example, I am proposing a

Re: Apache Pulsar connector for Beam

2019-10-24 Thread Reza Rokni

Hi Taher,

You can see the list of current and wip IO's here:

https://beam.apache.org/documentation/io/built-in/

Cheers

Reza

On Thu, 24 Oct 2019 at 13:56, Taher Koitawala  wrote:

> Hi All,
>  Been wanting to know if we have a Pulsar connector for Beam.
> Pulsar is another messaging queue like Kafka and I would like to build a
> streaming pipeline with Pulsar. Any help would be appreciated..
>
>
> Regards,
> Taher Koitawala
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Proposal: Dynamic timer support (BEAM-6857)

2019-10-22 Thread Reza Rokni

+1 on this, having the ability to create timers based on data would make a
bunch of use cases easier to write.

Any thoughts on having a isSet() / read() / setMinimum(timeStamp) type
ability?

On Wed, 23 Oct 2019 at 00:52, Reuven Lax  wrote:

> Kenn:
> +1 to using TimerFamily instead of TimerId and TimerMap.
>
> Jan:
> This is definitely not just for DSLs. I've definitely seen cases where the
> user wants different timers based on input data, so they cannot be defined
> statically. As a thought experiment: one stated goal of state + timers was
> to provide the low-level tools we use to implement windowing. However to
> implement windowing you need a dynamic set of timers, not just a single
> one. Now most users don't need to reimplement windowing (though we have had
> some users who had that need, when they wanted something slightly different
> than what native Beam windowing provided), however the need for dynamic
> timers is not unheard of.
>
> +1 to allowing dynamic state. However I think this is separate enough from
> timers that it doesn't need to be coupled in this discussion. Dynamic state
> also raises the wrinkle of pipeline compatibility (as you mentioned),
> which I think is a bit less of an issue for dynamic timers.
>
> Allowing a DSL to specify a DoFnSignature does not quite solve this
> problem. The DSL still needs a way to set and process the timers. It also
> does not solve the problem where the timers are based on input data
> elements, so cannot be known at pipeline construction time. However what
> might be more important is statically defining the timer families, and a
> DSL could do this by specifying a DoFnSignature (and something similar
> could be done with state). Also as mentioned above, this is useful to
> normal Beam users as well, and we shouldn't force normal users to start
> dealing with DoFnSignatures and DoFnInvokers.
>
>
>
>
>
>
> On Tue, Oct 22, 2019 at 7:56 AM Jan Lukavský  wrote:
>
>> Hi Max,
>>
>> wouldn't that be actually the same as
>>
>> class MyDoFn extends DoFn {
>>
>>
>>@ProcessElement
>>public void process(
>>ProcessContext context) {
>>  // "get" would register a new TimerSpec
>>  Timer timer1 = context.getTimer("timer1");
>>  Timer timer2 = context.getTimer("timer2");
>>  timers.set(...);
>>  timers.set(...);
>>}
>>
>> That is - no need to declare anything? One more concern about that - if
>> we allow registration of timers (or even state) dynamically like that it
>> might be harder to perform validation of pipeline upon upgrades.
>>
>> Jan
>>
>> On 10/22/19 4:47 PM, Maximilian Michels wrote:
>> > The idea makes sense to me. I really like that Beam gives upfront
>> > specs for timer and state, but it is not flexible enough for
>> > timer-based libraries or for users which want to dynamically generate
>> > timers.
>> >
>> > I'm not sure about the proposed API yet. Shouldn't we separate the
>> > timer specs from setting actual timers?
>> >
>> > Suggestion:
>> >
>> > class MyDoFn extends DoFn {
>> >   @TimerMap TimerMap timers = TimerSpecs.timerMap();
>> >
>> >   @ProcessElement
>> >   public void process(
>> >   @Element String e,
>> >   @TimerMap TimerMap timers)) {
>> > // "get" would register a new TimerSpec
>> > Timer timer1 = timers.get("timer1");
>> > Timer timer2 = timers.get("timer2");
>> > timers.set(...);
>> > timers.set(...);
>> >   }
>> >
>> >   // No args for "@OnTimer" => use generic TimerMap
>> >   @OnTimer
>> >   public void onTimer(
>> >   @TimerId String timerFired,
>> >   @Timestamp Instant timerTs,
>> >   @TimerMap TimerMap timers) {
>> >  // Timer firing
>> >  ...
>> >  // Set this timer (or another)
>> >  Timer timer = timers.get(timerFired);
>> >  timer.set(...);
>> >   }
>> > }
>> >
>> > What do you think?
>> >
>> > -Max
>> >
>> > On 22.10.19 10:35, Jan Lukavský wrote:
>> >> Hi Kenn,
>> >>
>> >> On 10/22/19 2:48 AM, Kenneth Knowles wrote:
>> >>> This seems extremely useful.
>> >>>
>> >>> I assume you mean `@OnTimer("timers")` in your example. I would
>> >>> suggest that the parameter annotation be something other
>> >>> than @TimerId since that annotation is already used for a very
>> >>> similar but different purpose; they are close enough that it is
>> >>> tempting to pun them, but it is clearer to keep them distinct IMO.
>> >>> Perhaps @TimerName or @TimerKey or some such. Alternatively,
>> >>> keep @TimerId in the parameter list and change the declaration
>> >>> to @TimerFamily("timers"). I think "family" or "group" may be more
>> >>> clear naming than "map".
>> >>>
>> >>> At the portability level, this API does seem to be pretty close to a
>> >>> noop in terms of the messages that needs to be sent over the Fn API,
>> >>> so it makes sense to loosen the protos. By the time the Fn API is in
>> >>> play, all of our desires to catch errors prior to execution are
>> >>> irrelevant anyhow.
>> >>>
>> >>> On the other hand, I think DSLs have

Re: Introduction + Support in Comms for Beam!

2019-10-01 Thread Reza Rokni

Welcome!

On Tue, 1 Oct 2019 at 11:18, Lukasz Cwik  wrote:

> Welcome to the community.
>
> On Mon, Sep 30, 2019 at 3:15 PM María Cruz  wrote:
>
>> Hi everyone,
>> my name is María Cruz, I am from Buenos Aires but I live in the Bay Area.
>> I recently became acquainted with Apache Beam project, and I got a chance
>> to meet some of the Beam community at Apache Con North America this past
>> September. I'm testing out a communications framework
>> 
>> for Open Source communities. I'm emailing the list now because I'd like to
>> work on a communications strategy for Beam, to make the most of the content
>> you produce during Beam Summits.
>>
>> A little bit more about me. I am a communications strategist with 11
>> years of experience in the field, 8 of which are in the non-profit sector.
>> I started working in Open Source in 2013, when I joined Wikimedia, the
>> social movement behind Wikipedia. I now work to support Google Open Source
>> projects, and I also volunteer in the communications team of the Apache
>> Software Foundation, working closely with Sally (for those of you who know
>> her).
>>
>> I will be sending the list a proposal in the coming days. Looking forward
>> to hearing from you!
>>
>> Best,
>>
>> María
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [VOTE] Sign a pledge to discontinue support of Python 2 in 2020.

2019-10-01 Thread Reza Rokni

+1

On Tue, 1 Oct 2019 at 13:54, Tanay Tummalapalli  wrote:

> +1
>
> On Tue, Oct 1, 2019 at 8:19 AM Suneel Marthi  wrote:
>
>> +1
>>
>> On Mon, Sep 30, 2019 at 10:33 PM Manu Zhang 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Oct 1, 2019 at 9:44 AM Austin Bennett <
>>> whatwouldausti...@gmail.com> wrote:
>>>
 +1

 On Mon, Sep 30, 2019 at 5:22 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Hi everyone,
>
> Please vote whether to sign a pledge on behalf of Apache Beam to
> sunset Beam Python 2 offering (in new releases) in 2020 on
> http://python3stament.org as follows:
>
> [ ] +1: Sign a pledge to discontinue support of Python 2 in Beam in
> 2020.
> [ ] -1: Do not sign a pledge to discontinue support of Python 2 in
> Beam in 2020.
>
> The motivation and details for this vote were discussed in [1, 2].
> Please follow up in [2] if you have any questions.
>
> This is a procedural vote [3] that will follow the majority approval
> rules and will be open for at least 72 hours.
>
> Thanks,
> Valentyn
>
> [1]
> https://lists.apache.org/thread.html/eba6caa58ea79a7ecbc8560d1c680a366b44c531d96ce5c699d41535@%3Cdev.beam.apache.org%3E
> [2]
> https://lists.apache.org/thread.html/456631fe1a696c537ef8ebfee42cd3ea8121bf7c639c52da5f7032e7@%3Cdev.beam.apache.org%3E
> [3] https://www.apache.org/foundation/voting.html
>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-26 Thread Reza Rokni

Thanks Valentin!

On Tue, 27 Aug 2019, 05:32 Pablo Estrada,  wrote:

> Thanks Valentyn!
>
> On Mon, Aug 26, 2019 at 2:29 PM Robin Qiu  wrote:
>
>> Thank you Valentyn! Congratulations!
>>
>> On Mon, Aug 26, 2019 at 2:28 PM Robert Bradshaw 
>> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Valentyn Tymofieiev
>>>
>>> Valentyn has made numerous contributions to Beam over the last several
>>> years (including 100+ pull requests), most recently pushing through
>>> the effort to make Beam compatible with Python 3. He is also an active
>>> participant in design discussions on the list, participates in release
>>> candidate validation, and proactively helps keep our tests green.
>>>
>>> In consideration of Valentyn's contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Valentyn, for your contributions and looking forward to many
>>> more!
>>>
>>> Robert, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

Re: SqlTransform Metadata

2019-08-21 Thread Reza Rokni

@Kenn / @Rob  has there been any other discussions on how the timestamp
value can be accessed from within the SQL since this thread in May?

If not my vote  is for a convenience method  that gives access to the
timestamp as a function call within the SQL statement.

Reza

On Wed, 22 May 2019 at 10:06, Reza Rokni  wrote:

> Hi,
>
> Coming back to this do we have enough of a consensus to say that in
> principle this is a good idea? If yes I will raise a Jira for this.
>
> Cheers
>
> Reza
>
> On Thu, 16 May 2019 at 02:58, Robert Bradshaw  wrote:
>
>> On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles  wrote:
>> >
>> > On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw 
>> wrote:
>> >>
>> >> Isn't there an API for concisely computing new fields from old ones?
>> >> Perhaps these expressions could contain references to metadata value
>> >> such as timestamp. Otherwise,
>> >
>> > Even so, being able to refer to the timestamp implies something about
>> its presence in a namespace, shared with other user-decided names.
>>
>> I was thinking that functions may live in a different namespace than
>> fields.
>>
>> > And it may be nice for users to use that API within the composite
>> SqlTransform. I think there are a lot of options.
>> >
>> >> Rather than withMetadata reifying the value as a nested field, with
>> >> the timestamp, window, etc. at the top level, one could let it take a
>> >> field name argument that attaches all the metadata as an extra
>> >> (struct-like) field. This would be like attachX, but without having to
>> >> have a separate method for every X.
>> >
>> > If you leave the input field names at the top level, then any "attach"
>> style API requires choosing a name that doesn't conflict with input field
>> names. You can't write a generic transform that works with all inputs. I
>> think it is much simpler to move the input field all into a nested
>> row/struct. Putting all the metadata in a second nested row/struct is just
>> as good as top-level, perhaps. But moving the input into the struct/row is
>> important.
>>
>> Very good point about writing generic transforms. It does mean a lot
>> of editing if one decides one wants to access the metadata field(s)
>> after-the-fact. (I also don't think we need to put the metadata in a
>> nested struct if the value is.)
>>
>> >> It seems restrictive to only consider this a a special mode for
>> >> SqlTransform rather than a more generic operation. (For SQL, my first
>> >> instinct would be to just make this a special function like
>> >> element_timestamp(), but there is some ambiguity there when there are
>> >> multiple tables in the expression.)
>> >
>> > I would propose it as both: we already have some Reify transforms, and
>> you could make a general operation that does this small data preparation
>> easily. I think the proposal is just to add a convenience build method on
>> SqlTransform to include the underlying functionality as part of the
>> composite, which we really already have.
>> >
>> > I don't think we should extend SQL with built-in functions for
>> element_timestamp() and things like that, because SQL already has TIMESTAMP
>> columns and it is very natural to use SQL on unbounded relations where the
>> timestamp is just part of the data.
>>
>> That's why I was suggesting a single element_metadata() rather than
>> exploding each one out.
>>
>> Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming
>> this is a special field, but distinct from the metadata timestamp?)
>>
>> >> On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > One use case would be when dealing with the windowing functions for
>> example:
>> >> >
>> >> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1'
>> HOUR) tumble_start
>> >> >   FROM PCOLLECTION
>> >> >   GROUP BY
>> >> > f_int,
>> >> > TUMBLE(f_timestamp, INTERVAL '1' HOUR)
>> >> >
>> >> > For an element which is using Metadata to inform the EvenTime of the
>> element, rather than data within the element itself, I would need to create
>> a new schema which added the timestamp as a field. I think other examples
>> which maybe interesting is getting the value of a row with the max/min
>> timestamp. None o

Re: Dataflow worker overview graphs

2019-08-20 Thread Reza Rokni

Very nice resource, thanx Mikhail.

On Fri, 9 Aug 2019 at 05:22, Mikhail Gryzykhin  wrote:

> Unfortunately no, I don't have those for streaming explicitly.
>
> However most of code is shared between streaming and batch with main
> difference in initialization. Same goes for boilerplate parts of legacy vs
> FnApi.
>
> If you happen to create anything similar for streaming, please update page
> and let me know. Also I'll update this page with relevant changes once I
> get back to worker.
>
> --Mikhail
>
> On Thu, Aug 8, 2019 at 2:13 PM Ankur Goenka  wrote:
>
>> Thanks Mikhail. This is really useful.
>> Do you also have something similar for Streaming use case. More
>> specifically for Portable (fn_api) based streaming pipelines.
>>
>>
>> On Thu, Aug 8, 2019 at 2:08 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hello everybody,
>>>
>>> Just wanted to share that I have found some graphs for dataflow worker I
>>> created while starting working on it. They cover specific scenarios, but
>>> may be useful for newcomers, so I put them into this wiki page
>>> 
>>> .
>>>
>>> If you feel they belong to some other location, please let me know.
>>>
>>> Regards,
>>> Mikhail.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-12 Thread Reza Rokni

+1

On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:

> +1
>
> On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles  wrote:
>
>> +1
>>
>> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang  wrote:
>>
>>> Hi Community,
>>>
>>> I am using this separate thread to collect votes on contributing Beam
>>> ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to Beam
>>> repo.
>>>
>>> There are discussions related to benefits, technical design and others
>>> on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note that this
>>> vote is not about merging the PR, which should be decided by code review.
>>> This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>>>
>>> +1: Beam repo can host Beam ZetaSQL
>>> -1: Beam repo should not host Beam ZetaSQL
>>>
>>> If there are more questions related to Beam ZetaSQL, please discuss it
>>> in [1].
>>>
>>> [1]:
>>> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
>>> [2]: https://github.com/apache/beam/pull/9210
>>>
>>> -Rui
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-06 Thread Reza Rokni

Congratz!

On Wed, 7 Aug 2019 at 06:40, Chamikara Jayalath 
wrote:

> Congrats!!
>
> On Tue, Aug 6, 2019 at 3:33 PM Udi Meiri  wrote:
>
>> Congrats Kyle!
>>
>> On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak 
>> wrote:
>>
>>> Congratulations Kyle!
>>>
>>> On Tue, Aug 6, 2019 at 1:36 PM Yichi Zhang  wrote:
>>>
 Congrats Kyle!

 On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Thank you, Kyle! And congratulations :)
>
> On Tue, Aug 6, 2019 at 10:09 AM Hannah Jiang 
> wrote:
>
>> Congrats Kyle!
>>
>> On Tue, Aug 6, 2019 at 9:52 AM David Morávek 
>> wrote:
>>
>>> Congratulations Kyle!!
>>>
>>> Sent from my iPhone
>>>
>>> On 6 Aug 2019, at 18:47, Anton Kedin  wrote:
>>>
>>> Congrats!
>>>
>>> On Tue, Aug 6, 2019, 9:37 AM Ankur Goenka  wrote:
>>>
 Congratulations Kyle!

 On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay 
 wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Kyle Weaver.
>
> Kyle has been contributing to Beam for a while now. And in that
> time period Kyle got the portable spark runner feature complete for 
> batch
> processing. [1]
>
> In consideration of Kyle's contributions, the Beam PMC trusts him
> with the responsibilities of a Beam committer [2].
>
> Thank you, Kyle, for your contributions and looking forward to
> many more!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
> https://lists.apache.org/thread.html/c43678fc24c9a1dc9f48c51c51950aedcb9bc0fd3b633df16c3d595a@%3Cuser.beam.apache.org%3E
> [2] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Jan Lukavský

2019-08-01 Thread Reza Rokni

Congratulations , awesome stuff !

On Thu, 1 Aug 2019, 12:11 Maximilian Michels,  wrote:

> Congrats, Jan! Good to see you become a committer :)
>
> On 01.08.19 12:37, Łukasz Gajowy wrote:
> > Congratulations!
> >
> > czw., 1 sie 2019 o 11:16 Robert Bradshaw  > > napisał(a):
> >
> > Congratulations!
> >
> > On Thu, Aug 1, 2019 at 9:59 AM Jan Lukavský  > > wrote:
> >
> > Thanks everyone!
> >
> > Looking forward to working with this great community! :-)
> >
> > Cheers,
> >
> >  Jan
> >
> > On 8/1/19 12:18 AM, Rui Wang wrote:
> > > Congratulations!
> > >
> > >
> > > -Rui
> > >
> > > On Wed, Jul 31, 2019 at 10:51 AM Robin Qiu  > > > wrote:
> > >
> > > Congrats!
> > >
> > > On Wed, Jul 31, 2019 at 10:31 AM Aizhamal Nurmamat kyzy
> > > mailto:aizha...@apache.org>> wrote:
> > >
> > > Congratulations, Jan! Thank you for your contributions!
> > >
> > > On Wed, Jul 31, 2019 at 10:04 AM Tanay Tummalapalli
> > > mailto:ttanay...@gmail.com>>
> wrote:
> > >
> > > Congratulations!
> > >
> > > On Wed, Jul 31, 2019 at 10:05 PM Ahmet Altay
> > > mailto:al...@google.com>>
> wrote:
> > >
> > > Congratulations Jan! Thank you for your
> > > contributions!
> > >
> > > On Wed, Jul 31, 2019 at 2:30 AM Ankur Goenka
> > > mailto:goe...@google.com>>
> > > wrote:
> > >
> > > Congratulations Jan!
> > >
> > > On Wed, Jul 31, 2019, 1:23 AM David
> > > Morávek  > > > wrote:
> > >
> > > Congratulations Jan, well deserved! ;)
> > >
> > > D.
> > >
> > > On Wed, Jul 31, 2019 at 10:17 AM Ryan
> > > Skraba  > > > wrote:
> > >
> > > Congratulations Jan!
> > >
> > > On Wed, Jul 31, 2019 at 10:10 AM
> > > Ismaël Mejía  > > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Please join me and the rest of
> > > the Beam PMC in welcoming a new
> > > > committer: Jan Lukavský.
> > > >
> > > > Jan has been contributing to
> > > Beam for a while, he was part of
> > > the team
> > > > that contributed the Euphoria
> > > DSL extension, and he has done
> > > > interesting improvements for the
> > > Spark and Direct runner. He has
> also
> > > > been active in the community
> > > discussions around the Beam model
> and
> > > > other subjects.
> > > >
> > > > In consideration of Jan's
> > > contributions, the Beam PMC trusts
> > > him with
> > > > the responsibilities of a Beam
> > > committer [1].
> > > >
> > > > Thank you, Jan, for your
> > > contributions and looking forward
> > > to many more!
> > > >
> > > > Ismaël, on behalf of the Apache
> > > Beam PMC
> > > >
> > > > [1]
> > >
> https://beam.apache.org/committer/committer
> > >
>
>

Re: Slowly changing lookup cache as a Table in BeamSql

2019-07-17 Thread Reza Rokni

*Can we use [slowly changing lookup cache] approach if the source is [HDFS
(or) HIVE] (data is changing), where the PCollection cannot fit into Memory
in BeamSQL?*

Can depend on the runner, in stream mode for the Dataflow runner the
sideinput needs to fit into memory.

On Wed, 17 Jul 2019 at 15:31, rahul patwari 
wrote:

> Hi,
>
> Please add me as a contributor to the Beam Issue Tracker. I would like to
> work on this feature. My ASF Jira Username: "rahul8383"
>
> Thanks,
> Rahul
>
>
>
> On Wed, Jul 17, 2019 at 1:06 AM Rui Wang  wrote:
>
>> Another approach is to let BeamSQL support it natively, as the title of
>> this thread says: "as a Table in BeamSQL".
>>
>> We might be able to define a table with properties that says this table
>> return a PCollectionView. By doing so we will have a trigger based
>> PCollectionView available in SQL rel nodes, thus SQL will be able to
>> implement [*Pattern: Slowly-changing lookup cache].* By this way, users
>> only need to construct a table and set it to SqlTransform
>> <https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/SqlTransform.java#L186>
>> *. *
>>
>> Create a JIRA to track this idea:
>> https://jira.apache.org/jira/browse/BEAM-7758
>>
>>
>> -Rui
>>
>>
>> On Tue, Jul 16, 2019 at 7:12 AM Reza Rokni  wrote:
>>
>>> Hi Rahul,
>>>
>>> FYI, that patterns is also available in the Beam docs  ( with updated
>>> code example )
>>> https://beam.apache.org/documentation/patterns/side-input-patterns/.
>>>
>>> Please note in the DoFn that feeds the View.asSingleton() you will need
>>> to manually call BigQuery using the BigQuery client.
>>>
>>> Regards
>>>
>>> Reza
>>>
>>> On Tue, 16 Jul 2019 at 14:37, rahul patwari 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> we are following [*Pattern: Slowly-changing lookup cache*] from
>>>> https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>>>>
>>>> We have a use case to read slowly changing bounded data as a
>>>> PCollection along with the main PCollection from Kafka(windowed) and use it
>>>> in the query of BeamSql.
>>>>
>>>> Is it possible to design such a use case with Beam Java SDK?
>>>>
>>>> Approaches followed but not Successful:
>>>>
>>>> 1) GenerateSequence => GlobalWindow with Data Trigger => Composite
>>>> Transform(which applies Beam I/O on the
>>>> pipeline[PCollection.getPipeline()]) => Convert the resulting PCollection
>>>> to PCollection Apply BeamSQL
>>>> Comments: Beam I/O reads data only once even though a long value is
>>>> generated from GenerateSequece with periodicity. The expectation is that
>>>> whenever a long value is generated, Beam I/O will be used to read the
>>>> latest data. Is this because of optimizations in the DAG? Can the
>>>> optimizations be overridden?
>>>>
>>>> 2) The pipeline is the same as approach 1. But, instead of using a
>>>> composite transform, a DoFn is used where a for loop will emit each Row of
>>>> the PCollection.
>>>> comments: The output PCollection is unbounded. But, we need a bounded
>>>> PCollection as this PCollection is used to JOIN with PCollection of each
>>>> window from Kafka. How can we convert an Unbounded PCollection to Bounded
>>>> PCollection inside a DoFn?
>>>>
>>>> Are there any better Approaches?
>>>>
>>>> Regards,
>>>> Rahul
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Slowly changing lookup cache as a Table in BeamSql

2019-07-16 Thread Reza Rokni

+1

On Tue, 16 Jul 2019 at 20:36, Rui Wang  wrote:

> Another approach is to let BeamSQL support it natively, as the title of
> this thread says: "as a Table in BeamSQL".
>
> We might be able to define a table with properties that says this table
> return a PCollectionView. By doing so we will have a trigger based
> PCollectionView available in SQL rel nodes, thus SQL will be able to
> implement [*Pattern: Slowly-changing lookup cache].* By this way, users
> only need to construct a table and set it to SqlTransform
> <https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/SqlTransform.java#L186>
> *. *
>
> Create a JIRA to track this idea:
> https://jira.apache.org/jira/browse/BEAM-7758
>
>
> -Rui
>
>
> On Tue, Jul 16, 2019 at 7:12 AM Reza Rokni  wrote:
>
>> Hi Rahul,
>>
>> FYI, that patterns is also available in the Beam docs  ( with updated
>> code example )
>> https://beam.apache.org/documentation/patterns/side-input-patterns/.
>>
>> Please note in the DoFn that feeds the View.asSingleton() you will need
>> to manually call BigQuery using the BigQuery client.
>>
>> Regards
>>
>> Reza
>>
>> On Tue, 16 Jul 2019 at 14:37, rahul patwari 
>> wrote:
>>
>>> Hi,
>>>
>>> we are following [*Pattern: Slowly-changing lookup cache*] from
>>> https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>>>
>>> We have a use case to read slowly changing bounded data as a PCollection
>>> along with the main PCollection from Kafka(windowed) and use it in the
>>> query of BeamSql.
>>>
>>> Is it possible to design such a use case with Beam Java SDK?
>>>
>>> Approaches followed but not Successful:
>>>
>>> 1) GenerateSequence => GlobalWindow with Data Trigger => Composite
>>> Transform(which applies Beam I/O on the
>>> pipeline[PCollection.getPipeline()]) => Convert the resulting PCollection
>>> to PCollection Apply BeamSQL
>>> Comments: Beam I/O reads data only once even though a long value is
>>> generated from GenerateSequece with periodicity. The expectation is that
>>> whenever a long value is generated, Beam I/O will be used to read the
>>> latest data. Is this because of optimizations in the DAG? Can the
>>> optimizations be overridden?
>>>
>>> 2) The pipeline is the same as approach 1. But, instead of using a
>>> composite transform, a DoFn is used where a for loop will emit each Row of
>>> the PCollection.
>>> comments: The output PCollection is unbounded. But, we need a bounded
>>> PCollection as this PCollection is used to JOIN with PCollection of each
>>> window from Kafka. How can we convert an Unbounded PCollection to Bounded
>>> PCollection inside a DoFn?
>>>
>>> Are there any better Approaches?
>>>
>>> Regards,
>>> Rahul
>>>
>>>
>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Robert Burke

2019-07-16 Thread Reza Rokni

Congratulations !

On Tue, 16 Jul 2019, 18:24 Ahmet Altay,  wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer: 
> Robert
> Burke.
>
> Robert has been contributing to Beam and actively involved in the
> community for over a year. He has been actively working on Go SDK, helping
> users, and making it easier for others to contribute [1].
>
> In consideration of Robert's contributions, the Beam PMC trusts him with
> the responsibilities of a Beam committer [2].
>
> Thank you, Robert, for your contributions and looking forward to many more!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
> https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
> [2] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
>

Re: Slowly changing lookup cache as a Table in BeamSql

2019-07-16 Thread Reza Rokni

Hi Rahul,

FYI, that patterns is also available in the Beam docs  ( with updated code
example )
https://beam.apache.org/documentation/patterns/side-input-patterns/.

Please note in the DoFn that feeds the View.asSingleton() you will need to
manually call BigQuery using the BigQuery client.

Regards

Reza

On Tue, 16 Jul 2019 at 14:37, rahul patwari 
wrote:

> Hi,
>
> we are following [*Pattern: Slowly-changing lookup cache*] from
> https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>
> We have a use case to read slowly changing bounded data as a PCollection
> along with the main PCollection from Kafka(windowed) and use it in the
> query of BeamSql.
>
> Is it possible to design such a use case with Beam Java SDK?
>
> Approaches followed but not Successful:
>
> 1) GenerateSequence => GlobalWindow with Data Trigger => Composite
> Transform(which applies Beam I/O on the
> pipeline[PCollection.getPipeline()]) => Convert the resulting PCollection
> to PCollection Apply BeamSQL
> Comments: Beam I/O reads data only once even though a long value is
> generated from GenerateSequece with periodicity. The expectation is that
> whenever a long value is generated, Beam I/O will be used to read the
> latest data. Is this because of optimizations in the DAG? Can the
> optimizations be overridden?
>
> 2) The pipeline is the same as approach 1. But, instead of using a
> composite transform, a DoFn is used where a for loop will emit each Row of
> the PCollection.
> comments: The output PCollection is unbounded. But, we need a bounded
> PCollection as this PCollection is used to JOIN with PCollection of each
> window from Kafka. How can we convert an Unbounded PCollection to Bounded
> PCollection inside a DoFn?
>
> Are there any better Approaches?
>
> Regards,
> Rahul
>
>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Looping timer blog

2019-06-26 Thread Reza Rokni

On Tue, 25 Jun 2019 at 21:20, Jan Lukavský  wrote:

>
> On 6/25/19 1:43 PM, Reza Rokni wrote:
>
>
>
> On Tue, 25 Jun 2019 at 18:12, Jan Lukavský  wrote:
>
>> > The TTL check would be in the same Timer rather than a separate Timer.
>> The max value processed in each OnTimer call would be stored in valuestate
>> and used as base to know how long it has been seen the pipeline has seen an
>> external value for that key.
>>
>> OK, that seems to work, if you store maximum timestamp in a value state
>> (that is, basically you generate per-key watermark).
>>
>> > You can store it in ValueState rather than BagState, but yes you store
>> that value in State ready for the next OnTimer() fire.
>>
>> In my understanding of the problem, I'd say that this approach seems a
>> little suboptimal. Consider the following, when trying to generate the OHLC
>> data (open, high, low, close, that is move last closing price to next
>> window opening price)
>>
>>  - suppose we have three times T1 < T2 < T3 < T4, where times T2 and T4
>> denote "end of windows" (closing times)
>>
>>  - first (in processing time), we receive value for time T3, we cache it
>> in ValueState, we set timer for T4
>>
>>  - next, we receive value for T1 - but we cannot overwrite the value
>> already written for T3, right? What to do then? And will we reset timer to
>> time T3?
>>
>>  - basically, because we received *two* values, both of which are needed
>> and no timer could have been fired in between, we need both values stored
>> to know which value to emit when timer fires. And consider that on batch,
>> the timer fires only after we see all data (without any ordering).
>>
> I assume you are referring to late data rather than out of order data (
> the later being taken care of with the in-memory sort). As discussed in the
> talk late data is a sharp edge, one solution for late data is to branch it
> away before GlobalWindow + State DoFn. This can then be output from the
> pipeline into a sink with a marker to initiate a manual procedure for
> correction. Essentially a manual redaction.
>
> Which in-memory sort do you refer to? I'm pretty sure there must be
> sorting involved for this to work, but I'm not quite sure where exactly it
> is in your implementation. You said that you can put data in ValueState
> rather than BagState, so do you have a List as a value in the ValueState?
> Or do you wrap the stateful par do with some other sorting logic? And if
> so, how does this work on batch? I suppose that it has to all fit to
> memory. I think this all goes around the @RequiresTimeSortedInput
> annotation, that I propose. Maybe we can cooperate on that? :)\
>
Hu... nice this chat made me notice a bug in the looping timer example
code we missed thanx :-) , the ValueState timerRunning, should
actually be a ValueState minTimestamp and the check to set the timer
needs to be if (NULL or  element.Timestamp is < then timer.Timestamp).
Which also requires the use of timer read pattern as we don't have
timer.read()
https://stackoverflow.com/questions/55912522/setting-a-timer-to-the-minimum-timestamp-seen/55912542#55912542.
I will fix and put a PR to correct the blog.

For the hold and propagate pattern (for those following the original thread
the pattern is not covered in the blog example code, but discussed at the
summit):
OnProcess()
- You can drop the accumulators into BagState.
- A timer is set at the minimum time interval.
OnTimer()
- The list is sorted in memory, for a lot of timeseries use cases (for
example ohlc) the memory issues are heavily mitigated as we can use a Fixed
Windows partial aggregations before the GlobalWindow stage. (Partial
because they dont have the correct Open value set until they flow into the
Global Window). Of course how big the window is dictates the compression
you would get.
- The current timer is set again to fire in the next interval window.

@RequiresTimeSortedInput looks super interesting, happy to help out.
Although its a harder problem then the targeted timeseries use cases where
FixedWindows aggregations can be used before the final step.

>
> Or? Am I missing something?
>>
>> Jan
>> On 6/25/19 6:00 AM, Reza Rokni wrote:
>>
>>
>>
>> On Fri, 21 Jun 2019 at 18:02, Jan Lukavský  wrote:
>>
>>> Hi Reza,
>>>
>>> great prezentation on the Beam Summit. I have had a few posts here in
>>> the list during last few weeks, some of which might actually be related to
>>> both looping timers and validity windows. But maybe you will be able to see
>>> a different approach, than I do, so questions:
>>>
>>>  a) because of [1] timers

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Reza Rokni

The use case of a transform  waiting for a SInk or Sinks to complete is
very interesting indeed!

Curious, if a sink internally makes use of a Global Window with processing
time triggers to push its writes, what mechanism could be used to release a
transform waiting for a signal from the Sink(s) that all processing is done
and it can move forward?

On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw  wrote:

> Regarding Python, yes and no. Python doesn't distinguish at compile
> time between (1), (2), and (6), but that doesn't mean it isn't part of
> the public API and people might start counting on it, so it's in some
> sense worse. We can also do (3) (which is less cumbersome in Python,
> either returning a tuple or a dict) or (4).
>
> Good point about providing a simple solution (something that can be
> waited on at least) and allowing for with* modifiers to return more.
>
> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath 
> wrote:
> >
> > BTW regarding Python SDK, I think the answer to this question is simpler
> for Python SDK due to the lack of types. Most examples I know just return a
> PCollection from the Write transform which may or may not be ignored by
> users. If the PCollection is used, the user should be aware of the element
> type of the returned PCollection and should use it accordingly in
> subsequent transforms.
> >
> > Thanks,
> > Cham
> >
> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath 
> wrote:
> >>
> >>
> >>
> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw 
> wrote:
> >>>
> >>> Good question.
> >>>
> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> objects (e.g there's nothing to wait on).
> >>>
> >>> There is also (6) return PCollection. The
> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> changing the public API, while giving something to wait on without
> >>> promising anything about its contents.
> >>>
> >>>
> >>> I would probably lean towards (4) for anything that would want to
> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> and view (3) as being a "cheap" but more cumbersome for the user way
> >>> of writing (4). In both cases, more information can be added in a
> >>> forward-compatible way. Technically (4) could extend (3) if one wants
> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
> >>> an aside, it would be interesting if any of the schema work that lets
> >>> us get rid of tuple tags for elements (e.g. join operations) could let
> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> >>> with PCollection members be as powerful as a PCollectionTuple).
> >>>
> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía 
> wrote:
> >>> >
> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
> >>> > processing of each window in a PCollection until signaled. This
> opened
> >>> > new interesting patterns for example writing to a database and when
> >>> > ‘fully’ done write to another database.
> >>> >
> >>> > To support this pattern an IO connector Write transform must return a
> >>> > type different from PDone to signal the processing of the next step.
> >>> > Some IOs have already started to implement this return type, but each
> >>> > returned type has different pros and cons so I wanted to open the
> >>> > discussion on this to see if we could somehow find a common pattern
> to
> >>> > suggest IO authors to follow (Note: It may be the case that there is
> >>> > not a pattern that fits certain use cases).
> >>> >
> >>> > So far the approaches in our code base are:
> >>> >
> >>> > 1. Write returns ‘PCollection’
> >>> >
> >>> > This is the simplest case but if subsequent transforms require more
> >>> > data that could have been produced during the write it gets ‘lost’.
> >>> > Used by JdbcIO and DynamoDBIO.
> >>> >
> >>> > 2. Write returns ‘PCollection’
> >>> >
> >>> > We can return whatever we want but the return type is uncertain for
> >>> > the user in case he wants to use information from it. This is less
> >>> > user friendly but has the maintenance advantage of not changing
> >>> > signatures if we want to change the return type in the future. Used
> by
> >>> > RabbitMQIO.
> >>> >
> >>> > 3. Write returns a `PCollectionTuple`
> >>> >
> >>> > It is like (2) but with the advantage of returning an untyped tuple
> of
> >>> > PCollections so we can return more things. Used by SnsIO.
> >>> >
> >>> > 4. Write returns ‘a class that implements POutput’
> >>> >
> >>> > This class wraps inside of the PCollections that were part of the
> >>> > write, e.g. SpannerWriteResult. This is useful because we can be
> >>> > interested on saving inside a PCollection of failed mutations apart
> of
> >>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> >>> > of this one is used by FileIO for Destinations via:
> >>> > ‘WriteFilesResult’.
> >>> >
> >>> > 5. Write returns ‘a class that implements POutput’ with specific data
> >>>

Re: Blogpost Beam Summit 2019

2019-06-25 Thread Reza Rokni

Thank you for putting this together!

On Wed, 26 Jun 2019 at 01:23, Ahmet Altay  wrote:

> Thank you for writing and sharing this. I enjoyed reading it :) I think it
> is worth sharing it as a tweet [1] as well.
>
> [1]  s.apache.org/beam-tweets
>
> On Tue, Jun 25, 2019 at 10:16 AM Valentyn Tymofieiev 
> wrote:
>
>> Hi Juta,
>>
>> Thanks for sharing! You can also consider sending it to user mailing list.
>>
>> Note that Datastore IO now supports Python 3:
>> https://lists.apache.org/thread.html/0a1fdb9b6b42b08a82eebf3b5b7898893ca236b5d3bb5c4751664034@%3Cuser.beam.apache.org%3E
>> .
>>
>> Thanks,
>> Valentyn
>>
>> On Tue, Jun 25, 2019 at 7:34 AM Juta Staes  wrote:
>>
>>>
>>> Hi all,
>>>
>>> First of all a big thank you to the organizers and the speakers of the
>>> Beam Summit from last week. I had a great time and learned a lot!
>>>
>>> I wrote a blogpost about the main topics that were covered during the
>>> summit:
>>> https://blog.ml6.eu/learnings-from-beam-summit-europe-2019-8d115900f1ee
>>> Any thoughts and feedback are welcome.
>>>
>>> Kind regards,
>>> Juta
>>> --
>>>
>>> [image: https://ml6.eu] 
>>>
>>> * Juta Staes*
>>> ML6 Gent
>>> 
>>>
>>>  DISCLAIMER 
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. If you have received this email in error please notify the
>>> system manager. This message contains confidential information and is
>>> intended only for the individual named. If you are not the named addressee
>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>> the sender immediately by e-mail if you have received this e-mail by
>>> mistake and delete this e-mail from your system. If you are not the
>>> intended recipient you are notified that disclosing, copying, distributing
>>> or taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>>
>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Looping timer blog

2019-06-25 Thread Reza Rokni

On Tue, 25 Jun 2019 at 18:12, Jan Lukavský  wrote:

> > The TTL check would be in the same Timer rather than a separate Timer.
> The max value processed in each OnTimer call would be stored in valuestate
> and used as base to know how long it has been seen the pipeline has seen an
> external value for that key.
>
> OK, that seems to work, if you store maximum timestamp in a value state
> (that is, basically you generate per-key watermark).
>
> > You can store it in ValueState rather than BagState, but yes you store
> that value in State ready for the next OnTimer() fire.
>
> In my understanding of the problem, I'd say that this approach seems a
> little suboptimal. Consider the following, when trying to generate the OHLC
> data (open, high, low, close, that is move last closing price to next
> window opening price)
>
>  - suppose we have three times T1 < T2 < T3 < T4, where times T2 and T4
> denote "end of windows" (closing times)
>
>  - first (in processing time), we receive value for time T3, we cache it
> in ValueState, we set timer for T4
>
>  - next, we receive value for T1 - but we cannot overwrite the value
> already written for T3, right? What to do then? And will we reset timer to
> time T3?
>
>  - basically, because we received *two* values, both of which are needed
> and no timer could have been fired in between, we need both values stored
> to know which value to emit when timer fires. And consider that on batch,
> the timer fires only after we see all data (without any ordering).
>
I assume you are referring to late data rather than out of order data ( the
later being taken care of with the in-memory sort). As discussed in the
talk late data is a sharp edge, one solution for late data is to branch it
away before GlobalWindow + State DoFn. This can then be output from the
pipeline into a sink with a marker to initiate a manual procedure for
correction. Essentially a manual redaction.

Or? Am I missing something?
>
> Jan
> On 6/25/19 6:00 AM, Reza Rokni wrote:
>
>
>
> On Fri, 21 Jun 2019 at 18:02, Jan Lukavský  wrote:
>
>> Hi Reza,
>>
>> great prezentation on the Beam Summit. I have had a few posts here in the
>> list during last few weeks, some of which might actually be related to both
>> looping timers and validity windows. But maybe you will be able to see a
>> different approach, than I do, so questions:
>>
>>  a) because of [1] timers might not be exactly ordered (the JIRA talks
>> about DirectRunner, but I suppose the issue is present on all runners that
>> use immutable bundles of size > 1, so might be related to Dataflow as
>> well). This might cause issues when you try to introduce TTL for looping
>> timers, because the TTL timer might get fired before regular looping timer,
>> which might cause incorrect results (state cleared before have been
>> flushed).
>>
> The TTL check would be in the same Timer rather than a separate Timer.
> The max value processed in each OnTimer call would be stored in valuestate
> and used as base to know how long it has been seen the pipeline has seen an
> external value for that key.
>
>>  b) because stateful pardo doesn't sort by timestamp, that implies, that
>> you have to store last values in BagState (as opposed to the blog, where
>> you just emit identity value of sum operation), right?
>>
> You can store it in ValueState rather than BagState, but yes you store
> that value in State ready for the next OnTimer() fire.
>
>>  c) because of how stateful pardo currently works on batch, does that
>> imply that all values (per key) would have to be stored in memory? would
>> that scale?
>>
> This is one of the sharp edges and the answer is ... it depends :-) I
> would recommend always using a  FixedWindow+Combiner before this step, this
> will compress the values into something much smaller. For example in case
> of building 'candles' this will compress down to low/hi/first/last values
> per FixedWindow length. If the window length is very small there maybe no
> compression, but in most cases I have seen this is a ok compromise.
>
>> There is a discussion about problem a) in [2], but maybe there is some
>> different approach possible. For problem b) and c) there is a proposal [3].
>> When the input is sorted, it starts to work both in batch and with
>> ValueState, because the last value is the *valid* value.
>>
> There was also a discussion on dev@ around a sorted Map state, which
> would be very cool for this usecase.
>
>> This has even connection with the mentioned validity windows, as if you
>> sort by timestamp, the _last_ value is the _valid_ value, so is essentially

Re: Looping timer blog

2019-06-24 Thread Reza Rokni

On Fri, 21 Jun 2019 at 18:02, Jan Lukavský  wrote:

> Hi Reza,
>
> great prezentation on the Beam Summit. I have had a few posts here in the
> list during last few weeks, some of which might actually be related to both
> looping timers and validity windows. But maybe you will be able to see a
> different approach, than I do, so questions:
>
>  a) because of [1] timers might not be exactly ordered (the JIRA talks
> about DirectRunner, but I suppose the issue is present on all runners that
> use immutable bundles of size > 1, so might be related to Dataflow as
> well). This might cause issues when you try to introduce TTL for looping
> timers, because the TTL timer might get fired before regular looping timer,
> which might cause incorrect results (state cleared before have been
> flushed).
>
The TTL check would be in the same Timer rather than a separate Timer.  The
max value processed in each OnTimer call would be stored in valuestate and
used as base to know how long it has been seen the pipeline has seen an
external value for that key.

>  b) because stateful pardo doesn't sort by timestamp, that implies, that
> you have to store last values in BagState (as opposed to the blog, where
> you just emit identity value of sum operation), right?
>
You can store it in ValueState rather than BagState, but yes you store that
value in State ready for the next OnTimer() fire.

>  c) because of how stateful pardo currently works on batch, does that
> imply that all values (per key) would have to be stored in memory? would
> that scale?
>
This is one of the sharp edges and the answer is ... it depends :-) I would
recommend always using a  FixedWindow+Combiner before this step, this will
compress the values into something much smaller. For example in case of
building 'candles' this will compress down to low/hi/first/last values per
FixedWindow length. If the window length is very small there maybe no
compression, but in most cases I have seen this is a ok compromise.

> There is a discussion about problem a) in [2], but maybe there is some
> different approach possible. For problem b) and c) there is a proposal [3].
> When the input is sorted, it starts to work both in batch and with
> ValueState, because the last value is the *valid* value.
>
There was also a discussion on dev@ around a sorted Map state, which would
be very cool for this usecase.

> This has even connection with the mentioned validity windows, as if you
> sort by timestamp, the _last_ value is the _valid_ value, so is essentially
> boils down to keep single value per key (and again, starts to work in both
> batch and stream).
>
one for Tyler :-)

> I even have a suspicion, that sorting by timestamp has close relation to
> retractions, because when you are using sorted streams, retractions
> actually became only diff between last emitted pane, and current pane. That
> might even help implement that in general, but I might be missing
> something. This just popped in my head today, as I was thinking why there
> was actually no (or little) need for retractions in Euphoria model (very
> similar to Beam, actually differs by the sorting thing :)), and why it the
> need pops out so often in Beam.
>
Retractions will be possible with this, but it does mean that we would need
to keep old versions around, something built in would be very cool rather
than building it with this pattern.

> I'd be very happy to hear what you think about all of this.
>
> Cheers,
>
> Jan
>
> [1] https://issues.apache.org/jira/browse/BEAM-7520
>
> [2]
> https://lists.apache.org/thread.html/1a3a0dd9da682e159f78f131d335782fd92b047895001455ff659613@%3Cdev.beam.apache.org%3E
>
> [3]
> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing
> On 6/21/19 8:12 AM, Reza Rokni wrote:
>
> Great question, one thing that we did not cover in the blog and I think we
> should have is the use case where you would want to bootstrap the
> pipeline.
>
> One option would be on startup to have an extra bounded source that is
> read and flattened into the main pipeline, the source will need to contain
> values in  Timestamped format which would correspond to the first window
> that you would like to kickstart the process from.  Will see if I can try
> and find some time to code up an example and add that and the looping timer
> code into the Beam patterns.
>
> https://beam.apache.org/documentation/patterns/overview/
>
> Cheers
> Reza
>
>
>
>
>
> On Fri, 21 Jun 2019 at 07:59, Manu Zhang  wrote:
>
>> Indeed interesting pattern.
>>
>> One minor question. It seems the timer is triggered by the first element
>> so what if there is no data in the "first interval" ?
>>
>> Thanks

Re: Testing code in extensions against runner

2019-06-24 Thread Reza Rokni

So if I understood correctly;

Emulate the SQL precommit / postcommit extension and incorporate running
the test against different runners.

Would be snazzy indeed! A bit beyond my skill set I fear :-)

On Wed, 19 Jun 2019 at 10:34, Kenneth Knowles  wrote:

> Slight point here: @ValidatesRunner should only be for tests that the
> runner implements the core model.
>
> Also, outside of the SDK core, you don't need it. If you use TestPipeline
> it already picks up the config for what runner. So all you need is to use
> TestPipeline and add it to some suite of tests that is run. Perhaps the
> Java precommit/postcommit or a postcommit dedicated to that extension (SQL
> has one).
>
> There's also a pretty big difference in terms of *who* is initiating the
> test / *what* is under test. Your tests are probably best viewed as tests
> of the extension, not tests of the runner. So that is another reason they
> should not be @ValidatesRunner.
>
> It would be snazzy if it were a one-liner to add all this testing in the
> gradle config for extensions.
>
> Kenn
>
> On Tue, Jun 18, 2019 at 12:05 AM Reza Rokni  wrote:
>
>> Thanx!
>>
>> It would definitely be great to have the ability for folks adding utility
>> / extensions to be able to have them run against all runners.
>>
>> Cheers
>> Reza
>>
>> On Fri, 7 Jun 2019, 19:05 Lukasz Cwik,  wrote:
>>
>>> We have been currently been having every runner define and manage its
>>> own suite/tests so yes modifying flink_runner.gradle is currently the
>>> correct thing to do.
>>>
>>> There is a larger discussion about whether this is the right way since
>>> we would like to capture things like perf benchmarks and validates runner
>>> tests so we can add information to the website about how well a feature is
>>> supported by each runner automatically.
>>>
>>>
>>>
>>> On Thu, Jun 6, 2019 at 8:36 PM Reza Rokni  wrote:
>>>
>>>> Hi,
>>>>
>>>> I would like to validate some code that I am building under
>>>> extensions against different runners. It makes use of some caches in a DoFn
>>>> which are a little off the beaten path.
>>>>
>>>> I have added @ValidatesRunner to the class and by adding the right
>>>> values to the gradle file in flink_runner have got the tests to run.
>>>> However it does not feel right for me to change the flink_runner.gradle
>>>> file to achieve this, especially as this is all experimental and under
>>>> extensions.
>>>>
>>>> I could copy over all the bits needed from the gradle file over to my
>>>> extensions gradle, but then I would need to do that for all runners , which
>>>> also feels a bit heavy weight. Is there a way, or should there be a way of
>>>> having a task added to my gradle file which will do tests against all
>>>> runners for me?
>>>>
>>>> Cheers
>>>> Reza
>>>>
>>>> --
>>>>
>>>> This email may be confidential and privileged. If you received this
>>>> communication by mistake, please don't forward it to anyone else, please
>>>> erase all copies and attachments, and please let me know that it has gone
>>>> to the wrong person.
>>>>
>>>> The above terms reflect a potential business arrangement, are provided
>>>> solely as a basis for further discussion, and are not intended to be and do
>>>> not constitute a legally binding obligation. No legally binding obligations
>>>> will be created, implied, or inferred until an agreement in final form is
>>>> executed in writing by all parties involved.
>>>>
>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [ANNOUNCE] New committer: Mikhail Gryzykhin

2019-06-21 Thread Reza Rokni

Congratulations!

On Fri, 21 Jun 2019, 12:37 Robert Burke,  wrote:

> Congrats
>
> On Fri, Jun 21, 2019, 12:29 PM Thomas Weise  wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Mikhail Gryzykhin.
>>
>> Mikhail has been contributing to Beam and actively involved in the
>> community for over a year. He developed the community build dashboard [1]
>> and added substantial improvements to our build infrastructure. Mikhail's
>> work also covers metrics, contributor documentation, development process
>> improvements and other areas.
>>
>> In consideration of Mikhail's contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer [2].
>>
>> Thank you, Mikhail, for your contributions and looking forward to many
>> more!
>>
>> Thomas, on behalf of the Apache Beam PMC
>>
>> [1] https://s.apache.org/beam-community-metrics
>> [2]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>

Re: Looping timer blog

2019-06-21 Thread Reza Rokni

Great question, one thing that we did not cover in the blog and I think we
should have is the use case where you would want to bootstrap the pipeline.

One option would be on startup to have an extra bounded source that is read
and flattened into the main pipeline, the source will need to contain
values in  Timestamped format which would correspond to the first window
that you would like to kickstart the process from.  Will see if I can try
and find some time to code up an example and add that and the looping timer
code into the Beam patterns.

https://beam.apache.org/documentation/patterns/overview/

Cheers
Reza

On Fri, 21 Jun 2019 at 07:59, Manu Zhang  wrote:

> Indeed interesting pattern.
>
> One minor question. It seems the timer is triggered by the first element
> so what if there is no data in the "first interval" ?
>
> Thanks for the write-up.
> Manu
>
> On Wed, Jun 19, 2019 at 12:15 PM Reza Rokni  wrote:
>
>> Hi folks,
>>
>> Just wanted to drop a note here on a new pattern that folks may find
>> interesting, called  Looping Timers. It allows for default values to be
>> created in interval windows in the absence of any external data coming into
>> the pipeline. The details are in this blog below:
>>
>> https://beam.apache.org/blog/2019/06/11/looping-timers.html
>>
>> Its main utility is when dealing with time series data. There are still
>> rough edges, like dealing with TTL and it would be great to hear
>> feedback on ways it can be improved.
>>
>> The next pattern to publish in this domain will assist will hold and
>> propagation of values from one interval window to the next, which coupled
>> to looping timers starts to solve some interesting problems.
>>
>> Cheers
>>
>> Reza
>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Looping timer blog

2019-06-18 Thread Reza Rokni

Hi folks,

Just wanted to drop a note here on a new pattern that folks may find
interesting, called  Looping Timers. It allows for default values to be
created in interval windows in the absence of any external data coming into
the pipeline. The details are in this blog below:

https://beam.apache.org/blog/2019/06/11/looping-timers.html

Its main utility is when dealing with time series data. There are still
rough edges, like dealing with TTL and it would be great to hear
feedback on ways it can be improved.

The next pattern to publish in this domain will assist will hold and
propagation of values from one interval window to the next, which coupled
to looping timers starts to solve some interesting problems.

Cheers

Reza



-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Testing code in extensions against runner

2019-06-18 Thread Reza Rokni

Thanx!

It would definitely be great to have the ability for folks adding utility /
extensions to be able to have them run against all runners.

Cheers
Reza

On Fri, 7 Jun 2019, 19:05 Lukasz Cwik,  wrote:

> We have been currently been having every runner define and manage its own
> suite/tests so yes modifying flink_runner.gradle is currently the correct
> thing to do.
>
> There is a larger discussion about whether this is the right way since we
> would like to capture things like perf benchmarks and validates runner
> tests so we can add information to the website about how well a feature is
> supported by each runner automatically.
>
>
>
> On Thu, Jun 6, 2019 at 8:36 PM Reza Rokni  wrote:
>
>> Hi,
>>
>> I would like to validate some code that I am building under
>> extensions against different runners. It makes use of some caches in a DoFn
>> which are a little off the beaten path.
>>
>> I have added @ValidatesRunner to the class and by adding the right
>> values to the gradle file in flink_runner have got the tests to run.
>> However it does not feel right for me to change the flink_runner.gradle
>> file to achieve this, especially as this is all experimental and under
>> extensions.
>>
>> I could copy over all the bits needed from the gradle file over to my
>> extensions gradle, but then I would need to do that for all runners , which
>> also feels a bit heavy weight. Is there a way, or should there be a way of
>> having a task added to my gradle file which will do tests against all
>> runners for me?
>>
>> Cheers
>> Reza
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

Re: [INPUT PLEASE] Introduction to Committing Workshop @BeamSummit

2019-06-09 Thread Reza Rokni

For patterns the "pipline-patterns" label would be good to add as well:

https://beam.apache.org/documentation/dsls/sql/overview/

On Mon, 10 Jun 2019 at 11:22, Henry Suryawirawan 
wrote:

> Adding new Kata might also be an option for people to contribute.
>
>
>
>
> On Mon, Jun 10, 2019 at 7:00 AM Kenneth Knowles  wrote:
>
>> The roadmap on the web site might also be a good resource. New
>> contributors might be interested in one of the efforts there. Hopefully it
>> is up to date. Each effort there could have ideas for how to get involved.
>>
>> Kenn
>>
>> On Sat, Jun 8, 2019, 19:55 Austin Bennett 
>> wrote:
>>
>>> Hi Dev Community,
>>>
>>> We are preparing for the Beam Summit, coming in June in Berlin.
>>>
>>> In JIRA, I see labels:
>>> * beginner
>>> *.easyfix
>>> * documentation
>>> * newbie
>>> * starter
>>>
>>> Anything else to point new (potential) committers to?
>>>
>>> Would also really love for people to upload to JIRA and make suggestions
>>> for things that those incredibly new to Beam/contributing might be able to
>>> do in a short period of time -- I would happily aggregate and make specific
>>> suggestions in the workshop.  We hope to get them familiar with tooling,
>>> our ways, and potentially actually submit a PR in a couple hours.
>>>
>>> Cheers,
>>> Austin
>>>
>>>
>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: @RequireTimeSortedInput design draft

2019-06-09 Thread Reza Rokni

Hi,

Interesting reading on the issue 143 :-) My example is more specific in its
scope but the general pattern will have uses with most timeseries I suspect.

The specific Jira is:

https://issues.apache.org/jira/browse/BEAM-7386?filter=-2

The signature is currently of the form:

public static class BiTemporalJoin
extends PTransform,
PCollection>>

if you are interested and have the bandwidth would be great to have you as
a reviewer for the PR. Also I hope to get time to contribute more around
timeseries utilities and would be great to have collaborators! I have note
looked into the detail of euphoria (looks interesting!) but it should be
reasonably straightforward to make use of the class in other places.

Cheers

Reza


On Fri, 7 Jun 2019 at 14:50, Jan Lukavský  wrote:

> Hi Reza, interesting suggestions, thanks.
>
> When you mentioned join, I recalled an older issue (which apparently was
> not yet transfered to Beam's JIRA)  [1]. Is this anyhow related to what you
> are implementing? Would you like to make your implementation accessible via
> Euphoria DSL [2]?
>
>  Jan
>
> [1] https://github.com/seznam/euphoria/issues/143
>
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/euphoria/src/main/java/org/apache/beam/sdk/extensions/euphoria/core/client/operator/Join.java
> On 6/7/19 7:06 AM, Reza Rokni wrote:
>
> Hi Jan,
>
> I have been working on a timeseries extension which makes use of many of
> these techniques for joining two temporal streams, it's almost ready for
> the PR, will ping it here when it is as it might be useful for you. In
> general, I borrowed a lot of techniques from CoGroupBy code.
>
> *1) need to figure out how to get Coder of input PCollection of stateful
> ParDo inside StatefulDoFnRunner*
> My join takes in a  , in the outer transform I use things like
> leftCollection.getCoder()).getValueCoder(); Then when creating the Join
> transform I can defer the StateSpec object creation until the constructor
> is called.
>
> *2) there are performance considerations, that can be solved probably only
> by Sorted Map State [2]*
> Sorted Map is going to be awesome, until then the only option is to create
> a Cache in the DoFn to make it more efficient. For the cache to work you
> need to key on Window + key and do things like clear the
> cache @Startbundle. Better to wait for Sorted Map if this is not time
> critical.
>
> *3) additional work is needed for allowedLateness to work correctly (and
> there are at least two ways how to solve this), see the design doc [3]*
> Yup, in my case I can support this by not GC the right side of the join
> for now, but that is a compromise.
>
> *4) more tests (for batch and validatesRunner) are needed*
> I just posted a question on the best way to make use of
> the @ValidateRunner annotation on this list, sounds like it might be useful
> to you as well :-)
>
>
> On Thu, 6 Jun 2019 at 23:03, Jan Lukavský  wrote:
>
>> Hi,
>>
>> I have written a PoC implementation of this in [1] and I'd like to
>> discuss some implementation details. First of all, I'd appreciate any
>> feedback about this. There are some known issues:
>>
>>   1) need to figure out how to get Coder of input PCollection of
>> stateful ParDo inside StatefulDoFnRunner
>>
>>   2) there are performance considerations, that can be solved probably
>> only by Sorted Map State [2]
>>
>>   3) additional work is needed for allowedLateness to work correctly
>> (and there are at least two ways how to solve this), see the design doc
>> [3]
>>
>>   4) more tests (for batch and validatesRunner) are needed
>>
>> I have come across a few bugs in DirectRunner, which I tried to solve:
>>
>>   a) timers seem to be broken in stateful pardo with side inputs
>>
>>   b) timers need to be sorted by timestamp, otherwise state might be
>> cleared before it gets chance to be flushed
>>
>>
>> Thanks for feedback,
>>
>>   Jan
>>
>>
>> [1] https://github.com/apache/beam/pull/8774
>>
>> [2]
>>
>> http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/%3ccalstk6+ldemtjmnuysn3vcufywjkhmgv1isfbdmxthoqh91...@mail.gmail.com%3e
>>
>> [3]
>>
>> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/
>>
>>
>> On 5/23/19 4:40 PM, Robert Bradshaw wrote:
>> > Thanks for writing this up.
>> >
>> > I think the justification for adding this to the model needs to be
>> > that it is useful (you have this covered, though some examples would
>> > be nice) and that it's something that can't easily be done by users
>> > themselves

Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Reza Rokni

+1 on the pattern Tim!

Please raise a Jira with the label pipeline-patterns, details are here:
https://beam.apache.org/documentation/patterns/overview/#contributing-a-pattern



On Sat, 8 Jun 2019 at 05:04, Tim Robertson 
wrote:

> This is great. Thanks Pablo and all
>
> I've seen several folk struggle with writing avro to dynamic locations
> which I think might be a good addition. If you agree I'll offer a PR unless
> someone gets there first - I have an example here:
>
> https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81
>
>
> On Fri, Jun 7, 2019 at 10:52 PM Pablo Estrada  wrote:
>
>> Hello everyone,
>> A group of community members has been working on gathering and providing
>> common pipeline patterns for pipelines in Beam. These are examples on how
>> to perform certain operations, and useful ways of using Beam in your
>> pipelines. Some of them relate to processing of files, use of side inputs,
>> sate/timers, etc. Check them out[1].
>>
>> These initial patterns have been chosen based on evidence gathered from
>> StackOverflow, and from talking to users of Beam.
>>
>> It would be great if this section could grow, and be useful to many Beam
>> users. For that reason, we invite anyone to share patterns, and pipeline
>> examples that they have used in the past. If you are interested in
>> contributing, please submit a pull request, or get in touch with Cyrus
>> Maden, Reza Rokni, Melissa Pashniak or myself.
>>
>> Thanks!
>> Best
>> -P.
>>
>> [1] https://beam.apache.org/documentation/patterns/overview/
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: @RequireTimeSortedInput design draft

2019-06-06 Thread Reza Rokni

Hi Jan,

I have been working on a timeseries extension which makes use of many of
these techniques for joining two temporal streams, it's almost ready for
the PR, will ping it here when it is as it might be useful for you. In
general, I borrowed a lot of techniques from CoGroupBy code.

*1) need to figure out how to get Coder of input PCollection of stateful
ParDo inside StatefulDoFnRunner*
My join takes in a  , in the outer transform I use things like
leftCollection.getCoder()).getValueCoder(); Then when creating the Join
transform I can defer the StateSpec object creation until the constructor
is called.

*2) there are performance considerations, that can be solved probably only
by Sorted Map State [2]*
Sorted Map is going to be awesome, until then the only option is to create
a Cache in the DoFn to make it more efficient. For the cache to work you
need to key on Window + key and do things like clear the
cache @Startbundle. Better to wait for Sorted Map if this is not time
critical.

*3) additional work is needed for allowedLateness to work correctly (and
there are at least two ways how to solve this), see the design doc [3]*
Yup, in my case I can support this by not GC the right side of the join for
now, but that is a compromise.

*4) more tests (for batch and validatesRunner) are needed*
I just posted a question on the best way to make use of the @ValidateRunner
annotation on this list, sounds like it might be useful to you as well :-)

On Thu, 6 Jun 2019 at 23:03, Jan Lukavský  wrote:

> Hi,
>
> I have written a PoC implementation of this in [1] and I'd like to
> discuss some implementation details. First of all, I'd appreciate any
> feedback about this. There are some known issues:
>
>   1) need to figure out how to get Coder of input PCollection of
> stateful ParDo inside StatefulDoFnRunner
>
>   2) there are performance considerations, that can be solved probably
> only by Sorted Map State [2]
>
>   3) additional work is needed for allowedLateness to work correctly
> (and there are at least two ways how to solve this), see the design doc [3]
>
>   4) more tests (for batch and validatesRunner) are needed
>
> I have come across a few bugs in DirectRunner, which I tried to solve:
>
>   a) timers seem to be broken in stateful pardo with side inputs
>
>   b) timers need to be sorted by timestamp, otherwise state might be
> cleared before it gets chance to be flushed
>
>
> Thanks for feedback,
>
>   Jan
>
>
> [1] https://github.com/apache/beam/pull/8774
>
> [2]
>
> http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/%3ccalstk6+ldemtjmnuysn3vcufywjkhmgv1isfbdmxthoqh91...@mail.gmail.com%3e
>
> [3]
>
> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/
>
>
> On 5/23/19 4:40 PM, Robert Bradshaw wrote:
> > Thanks for writing this up.
> >
> > I think the justification for adding this to the model needs to be
> > that it is useful (you have this covered, though some examples would
> > be nice) and that it's something that can't easily be done by users
> > themselves (specifically, though it can be (relatively) cheaply done
> > in streaming and batch, it's done in very different ways, and also
> > that it's hard to do via composition).
> >
> > On Thu, May 23, 2019 at 4:10 PM Jan Lukavský  wrote:
> >> Hi,
> >>
> >> I have written a very brief draft of how it might be possible to
> >> implement @RequireTimeSortedInput discussed in [1]. I see the document
> >> [2] a starting point for a discussion. There are several open questions,
> >> which I believe can be resolved by this great community. :-)
> >>
> >> Jan
> >>
> >> [1]
> http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/browser
> >>
> >> [2]
> >>
> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/
> >>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Testing code in extensions against runner

2019-06-06 Thread Reza Rokni

Hi,

I would like to validate some code that I am building under
extensions against different runners. It makes use of some caches in a DoFn
which are a little off the beaten path.

I have added @ValidatesRunner to the class and by adding the right values
to the gradle file in flink_runner have got the tests to run. However it
does not feel right for me to change the flink_runner.gradle file to
achieve this, especially as this is all experimental and under extensions.

I could copy over all the bits needed from the gradle file over to my
extensions gradle, but then I would need to do that for all runners , which
also feels a bit heavy weight. Is there a way, or should there be a way of
having a task added to my gradle file which will do tests against all
runners for me?

Cheers
Reza

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Timer support in Flink

2019-06-06 Thread Reza Rokni

I changed the font size and the wording a little, its in this PR:

https://github.com/apache/beam/pull/8773

I started to mess around with moving it around to better position etc.. But
then I quickly remembered why no one lets me near web pages css etc... :-)
Anyone else want to enhance that PR please go for it!

Cheers

Reza

On Tue, 4 Jun 2019 at 15:42, Robert Bradshaw  wrote:

> One issue with the fully expanded version is that it's so large it's
> hard to read.
>
> I think it would be useful to make the ~ entries (at least) clickable
> or with hover tool tips. It would be nice to be able to expand columns
> individually as well.
>
> On Tue, Jun 4, 2019 at 7:20 AM Melissa Pashniak 
> wrote:
> >
> >
> > Yeah, people's eyes likely jump to the big "What is being computed?"
> header first and skip the small font "expand details" (that's what my eyes
> did anyway!) Even just moving the expand/collapse to be AFTER the header of
> the table (or down to the next line)  and making the font bigger might help
> a lot. And maybe making the text more explicit: "Click to expand for more
> details".
> >
> > I'm traveling right now so can't take an in-depth look, but this might
> be doable by changing the order of things in [1] and the font size in [2].
> I'll add this info to the JIRA also.
> >
> > [1]
> https://github.com/apache/beam/blame/master/website/src/_includes/capability-matrix.md#L18
> > [2]
> https://github.com/apache/beam/blob/master/website/src/_sass/capability-matrix.scss#L130
> >
> >
> > On Mon, Jun 3, 2019 at 2:15 AM Maximilian Michels 
> wrote:
> >>
> >> Good point. I think I discovered the detailed view when I made changes
> >> to the source code. Classic tunnel-vision problem :)
> >>
> >> On 30.05.19 12:57, Reza Rokni wrote:
> >> > :-)
> >> >
> >> > https://issues.apache.org/jira/browse/BEAM-7456
> >> >
> >> > On Thu, 30 May 2019 at 18:41, Alex Van Boxel  >> > <mailto:a...@vanboxel.be>> wrote:
> >> >
> >> > Oh... you can expand the matrix. Never saw that, this could indeed
> >> > be better. So it isn't you.
> >> >
> >> >   _/
> >> > _/ Alex Van Boxel
> >> >
> >> >
> >> > On Thu, May 30, 2019 at 12:24 PM Reza Rokni  >> >     <mailto:r...@google.com>> wrote:
> >> >
> >> > PS, until it was just pointed out to me by Max, I had missed
> the
> >> > (expand details) clickable link in the capability matrix.
> >> >
> >> > Probably just me, but do others think it's also easy to miss?
> If
> >> > yes I will raise a Jira for it
> >> >
> >> > On Wed, 29 May 2019 at 19:52, Reza Rokni  >> > <mailto:r...@google.com>> wrote:
> >> >
> >> > Thanx Max!
> >> >
> >> > Reza
> >> >
> >> > On Wed, 29 May 2019, 16:38 Maximilian Michels,
> >> > mailto:m...@apache.org>> wrote:
> >> >
> >> > Hi Reza,
> >> >
> >> > The detailed view of the capability matrix states:
> "The
> >> > Flink Runner
> >> >     supports timers in non-merging windows."
> >> >
> >> > That is still the case. Other than that, timers should
> >> > be working fine.
> >> >
> >> >  > It makes very heavy use of Event.Time timers and
> has
> >> > to do some manual DoFn cache work to get around some
> >> > O(heavy) issues.
> >> >
> >> > If you are running on Flink 1.5, timer deletion
> suffers
> >> > from O(n)
> >> > complexity which has been fixed in newer versions.
> >> >
> >> > Cheers,
> >> > Max
> >> >
> >> > On 29.05.19 03:27, Reza Rokni wrote:
> >> >  > Hi Flink experts,
> >> >  >
> >> >  > I am getting ready to push a PR around a utility
> >> > class for timeseries join
> >> >  >
> >> >  > left.timestamp

Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-06-01 Thread Reza Rokni

For layer 1, what about working through this link as a starting point :
https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
?

On Sat, 1 Jun 2019 at 09:21, Ahmet Altay  wrote:

> Thank you Reza. That separation makes sense to me.
>
> On Wed, May 29, 2019 at 6:26 PM Reza Rokni  wrote:
>
>> +1
>>
>> I think there will be at least two layers of this;
>>
>> Layer 1 - Using primitives : I do join, GBK, Aggregation... with system x
>> this way, what is the canonical equivalent in Beam.
>> Layer 2 - Patterns : I read and join Unbounded and Bounded Data in system
>> x this way, what is the canonical equivalent in Beam.
>>
>> I suspect as a first pass Layer 1 is reasonably well bounded work, there
>> would need to be agreement on "canonical" version of how to do something in
>> Beam as this could be seen to be opinionated. As there are often a
>> multitude of ways of doing x
>>
>
> Once we identify a set of layer 1 items, we could crowd source the
> canonical implementations. I believe we can use our usual code review
> process to settle on a version that is agreeable. (Examples have the same
> issue, they are probably opinionated today based on the author but it works
> out.)
>
>
>>
>>
>> On Thu, 30 May 2019 at 08:56, Ahmet Altay  wrote:
>>
>>> Hi all,
>>>
>>> Inspired by the user asking about a Spark feature in Beam [1] in the
>>> release thread, I searched the user@ list and noticed a few instances
>>> of people asking for question like "I can do X in Spark, how can I do that
>>> in Beam?" Would it make sense to add documentation to explain how certain
>>> tasks that can be accomplished in Beam with side by side examples of doing
>>> the same task in Beam/Spark etc. It could help with on-boarding because it
>>> will be easier for people to leverage their existing knowledge. It could
>>> also help other frameworks as well, because it will serve as a Rosetta
>>> stone with two translations.
>>>
>>> Questions I have are:
>>> - Would such a thing be a helpful?
>>> - Is it feasible? Would a few pages worth of examples can cover enough
>>> use cases?
>>>
>>> Thank you!
>>> Ahmet
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E
>>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Timer support in Flink

2019-05-30 Thread Reza Rokni

:-)

https://issues.apache.org/jira/browse/BEAM-7456

On Thu, 30 May 2019 at 18:41, Alex Van Boxel  wrote:

> Oh... you can expand the matrix. Never saw that, this could indeed be
> better. So it isn't you.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Thu, May 30, 2019 at 12:24 PM Reza Rokni  wrote:
>
>> PS, until it was just pointed out to me by Max, I had missed the (expand
>> details) clickable link in the capability matrix.
>>
>> Probably just me, but do others think it's also easy to miss? If yes I
>> will raise a Jira for it
>>
>> On Wed, 29 May 2019 at 19:52, Reza Rokni  wrote:
>>
>>> Thanx Max!
>>>
>>> Reza
>>>
>>> On Wed, 29 May 2019, 16:38 Maximilian Michels,  wrote:
>>>
>>>> Hi Reza,
>>>>
>>>> The detailed view of the capability matrix states: "The Flink Runner
>>>> supports timers in non-merging windows."
>>>>
>>>> That is still the case. Other than that, timers should be working fine.
>>>>
>>>> > It makes very heavy use of Event.Time timers and has to do some
>>>> manual DoFn cache work to get around some O(heavy) issues.
>>>>
>>>> If you are running on Flink 1.5, timer deletion suffers from O(n)
>>>> complexity which has been fixed in newer versions.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> On 29.05.19 03:27, Reza Rokni wrote:
>>>> > Hi Flink experts,
>>>> >
>>>> > I am getting ready to push a PR around a utility class for
>>>> timeseries join
>>>> >
>>>> > left.timestamp match to closest right.timestamp where right.timestamp
>>>> <=
>>>> > left.timestamp.
>>>> >
>>>> > It makes very heavy use of Event.Time timers and has to do some
>>>> manual
>>>> > DoFn cache work to get around some O(heavy) issues. Wanted to test
>>>> > things against Flink: In the capability matrix we have "~" for Timer
>>>> > support in Flink:
>>>> >
>>>> > https://beam.apache.org/documentation/runners/capability-matrix/
>>>> >
>>>> > Is that page outdated, if not what are the areas that still need to
>>>> be
>>>> > addressed please?
>>>> >
>>>> > Cheers
>>>> >
>>>> > Reza
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > This email may be confidential and privileged. If you received this
>>>> > communication by mistake, please don't forward it to anyone else,
>>>> please
>>>> > erase all copies and attachments, and please let me know that it has
>>>> > gone to the wrong person.
>>>> >
>>>> > The above terms reflect a potential business arrangement, are
>>>> provided
>>>> > solely as a basis for further discussion, and are not intended to be
>>>> and
>>>> > do not constitute a legally binding obligation. No legally binding
>>>> > obligations will be created, implied, or inferred until an agreement
>>>> in
>>>> > final form is executed in writing by all parties involved.
>>>> >
>>>>
>>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Shuffling on apache beam

2019-05-30 Thread Reza Rokni

Hi,

Would you mind sharing your latency requirements? For example is it < 1 sec
at XX percentile?

With regards to Stateful DoFn with a few exceptions it is supported :
https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-what

Cheers

Reza




On Thu, 30 May 2019 at 18:08, pasquale.bon...@gmail.com <
pasquale.bon...@gmail.com> wrote:

> This was my first option but I'm using google dataflow as runner and it's
> not clear if it supports stateful DoFn.
> However my problem is latency, I've been trying different solution but it
> seems difficult to bring latency under 1s when consuming message (150/s
> )from PubSub with beam/dataflow.
> Is some benchmark or example available to understand if we can effectively
> achieve low latency pr we should look at different solutions?
>
>
>
> On 2019/05/29 17:12:37, Pablo Estrada  wrote:
> > If you add a stateful DoFn to your pipeline, you'll force Beam to shuffle
> > data to their corresponding worker per key. I am not sure what is the
> > latency cost of doing this (as the messages still need to be shuffled).
> But
> > it may help you accomplish this without adding windowing+triggering.
> >
> > -P.
> >
> > On Wed, May 29, 2019 at 5:16 AM pasquale.bon...@gmail.com <
> > pasquale.bon...@gmail.com> wrote:
> >
> > > Hi Reza,
> > > with GlobalWindow with triggering I was able to reduce hotspot issues
> > > gaining satisfying performance for BigTable update. Unfortunately
> latency
> > > when getting messages from PubSub remains around 1.5s that it's too
> much
> > > considering our NFR.
> > >
> > > This is the code I use to get the messages:
> > > PCollectionTuple rawTransactions = p //
> > > .apply("GetMessages",
> > >
> > >
> PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME)
> > >
> > >
> .withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic))
> > >
>  .apply(Window.configure()
> > > .triggering(Repeatedly
> > >
> > > .forever(AfterWatermark.pastEndOfWindow()
> > > .withEarlyFirings(
> > > AfterProcessingTime
> > >
> > > .pastFirstElementInPane()
> > >
> > > .plusDelayOf(Duration.millis(1)))
> > > // Fire on
> any
> > > late data
> > >
> > > .withLateFirings(AfterPane.elementCountAtLeast(1
> > > .discardingFiredPanes())
> > >
> > > Messages are produced with a different dataflow:
> > >  Pipeline p = Pipeline.create(options);
> > > p.apply(
> > > "ReadFile",
> > > TextIO.read()
> > > .from(options.getInputLocation() + "/*.csv")
> > > .watchForNewFiles(
> > > // Check for new files every 1 seconds
> > > Duration.millis(600),
> > > // Never stop checking for new files
> > > Watch.Growth.never()))
> > > .apply(
> > > "create message",
> > > ParDo.of(
> > > new DoFn() {
> > >   @ProcessElement
> > >   public void processElement(ProcessContext context) {
> > > String line = context.element();
> > >
> > > String payload = convertRow(line);
> > > long now =
> > >
> LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
> > > context.output(
> > > new PubsubMessage(
> > > payload.getBytes(),
> > > ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME,
> > > payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME,
> > > Long.toString(now;
> > >   }
> > > }))
> > > .apply("publish message", PubsubIO.writeMessages().to(topic));
> > >
> > > I'm uploading a file containing 100 rows every 600 ms.
> > >
> > > I found different threads on satckoverflow around this latency issue,
> but
> > > none has a solution.
> > >
> > >
> &g

Re: Timer support in Flink

2019-05-30 Thread Reza Rokni

PS, until it was just pointed out to me by Max, I had missed the (expand
details) clickable link in the capability matrix.

Probably just me, but do others think it's also easy to miss? If yes I will
raise a Jira for it

On Wed, 29 May 2019 at 19:52, Reza Rokni  wrote:

> Thanx Max!
>
> Reza
>
> On Wed, 29 May 2019, 16:38 Maximilian Michels,  wrote:
>
>> Hi Reza,
>>
>> The detailed view of the capability matrix states: "The Flink Runner
>> supports timers in non-merging windows."
>>
>> That is still the case. Other than that, timers should be working fine.
>>
>> > It makes very heavy use of Event.Time timers and has to do some manual
>> DoFn cache work to get around some O(heavy) issues.
>>
>> If you are running on Flink 1.5, timer deletion suffers from O(n)
>> complexity which has been fixed in newer versions.
>>
>> Cheers,
>> Max
>>
>> On 29.05.19 03:27, Reza Rokni wrote:
>> > Hi Flink experts,
>> >
>> > I am getting ready to push a PR around a utility class for
>> timeseries join
>> >
>> > left.timestamp match to closest right.timestamp where right.timestamp
>> <=
>> > left.timestamp.
>> >
>> > It makes very heavy use of Event.Time timers and has to do some manual
>> > DoFn cache work to get around some O(heavy) issues. Wanted to test
>> > things against Flink: In the capability matrix we have "~" for Timer
>> > support in Flink:
>> >
>> > https://beam.apache.org/documentation/runners/capability-matrix/
>> >
>> > Is that page outdated, if not what are the areas that still need to be
>> > addressed please?
>> >
>> > Cheers
>> >
>> > Reza
>> >
>> >
>> > --
>> >
>> > This email may be confidential and privileged. If you received this
>> > communication by mistake, please don't forward it to anyone else,
>> please
>> > erase all copies and attachments, and please let me know that it has
>> > gone to the wrong person.
>> >
>> > The above terms reflect a potential business arrangement, are provided
>> > solely as a basis for further discussion, and are not intended to be
>> and
>> > do not constitute a legally binding obligation. No legally binding
>> > obligations will be created, implied, or inferred until an agreement in
>> > final form is executed in writing by all parties involved.
>> >
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-05-29 Thread Reza Rokni

+1

I think there will be at least two layers of this;

Layer 1 - Using primitives : I do join, GBK, Aggregation... with system x
this way, what is the canonical equivalent in Beam.
Layer 2 - Patterns : I read and join Unbounded and Bounded Data in system x
this way, what is the canonical equivalent in Beam.

I suspect as a first pass Layer 1 is reasonably well bounded work, there
would need to be agreement on "canonical" version of how to do something in
Beam as this could be seen to be opinionated. As there are often a
multitude of ways of doing x

On Thu, 30 May 2019 at 08:56, Ahmet Altay  wrote:

> Hi all,
>
> Inspired by the user asking about a Spark feature in Beam [1] in the
> release thread, I searched the user@ list and noticed a few instances of
> people asking for question like "I can do X in Spark, how can I do that in
> Beam?" Would it make sense to add documentation to explain how certain
> tasks that can be accomplished in Beam with side by side examples of doing
> the same task in Beam/Spark etc. It could help with on-boarding because it
> will be easier for people to leverage their existing knowledge. It could
> also help other frameworks as well, because it will serve as a Rosetta
> stone with two translations.
>
> Questions I have are:
> - Would such a thing be a helpful?
> - Is it feasible? Would a few pages worth of examples can cover enough use
> cases?
>
> Thank you!
> Ahmet
>
> [1]
> https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Hazelcast Jet Runner

2019-05-29 Thread Reza Rokni

Hi,

Over 800 usages under java, might be worth doing a few PR...

Also suggest we use a very light review process: First round go for low
hanging fruit, if anyone does a -1 against a change then we leave that for
round two.

Thoughts?

Cheers

Reza

On Wed, 29 May 2019 at 12:05, Kenneth Knowles  wrote:

>
>
> On Mon, May 27, 2019 at 4:05 PM Reza Rokni  wrote:
>
>> "Many APIs that have been in place for years and are used by most Beam
>> users are still marked Experimental."
>>
>> Should there be a formal process in place to start 'graduating' features
>> out of @Experimental? Perhaps even target an up coming release with a PR to
>> remove the annotation from well established API's?
>>
>
> Good idea. I think a PR like this would be an opportunity to discuss
> whether the feature is non-experimental. Probably many of them are ready.
> It would help to address Ismael's very good point that this new practice
> could make users think the old Experimental stuff is not experimental.
> Maybe it is true that it is not really still Experimental.
>
> Kenn
>
>
>
>> On Tue, 28 May 2019 at 06:44, Reuven Lax  wrote:
>>
>>> We generally use Experimental for two different things, which leads to
>>> confusion.
>>>   1. Features that work stably, but where we think we might still make
>>> some changes to the API.
>>>   2. New features that we think might not yet be stable.
>>>
>>> This dual usage leads to a lot of confusion IMO. The fact that we tend
>>> to forget to remove the @Experimental tag also makes it somewhat useless.
>>> Many APIs that have been in place for years and are used by most Beam users
>>> are still marked Experimental.
>>>
>>> Reuven
>>>
>>> On Mon, May 27, 2019 at 2:16 PM Ismaël Mejía  wrote:
>>>
>>>> > Personally, I think that it is good that moving from experimental to
>>>> non-experimental is a breaking change in the dependency - one has
>>>> backwards-incompatible changes and the other does not. If artifacts had
>>>> separate versioning we could use 0.x for this.
>>>>
>>>> In theory it seems so, but in practice it is an annoyance to an end
>>>> user that already took the ‘risk’ of using an experimental feature.
>>>> Awareness is probably not the most important reason to break existing
>>>> code (even if it could be easily fixed). The alternative of doing this
>>>> with version numbers at least seems less impacting but can be
>>>> confusing.
>>>>
>>>> > But biggest motivation for me are these:
>>>> >
>>>> >  - using experimental features should be opt-in
>>>> >  - should be impossible to use an experimental feature without
>>>> knowing it (so "opt-in" to a normal-looking feature is not enough)
>>>> > - developers of an experimental feature should be motivated to
>>>> "graduate" it
>>>>
>>>> The fundamental problem of this approach is inconsistency with our
>>>> present/past. So far we have ‘Experimental’ features everywhere. So
>>>> suddenly becoming opt-in let us in an inconsistent state. For example
>>>> all IOs are marked internally as Experimental but not at the level of
>>>> directories/artifacts. Adding this suffix in a new IO apart of adding
>>>> fear of use to the end users may also give the fake impression that
>>>> the older ones not explicitly marked are not experimental.
>>>>
>>>> What will be the state for example in the case of runner modules that
>>>> contain both mature and well tested runners like old Flink and Spark
>>>> runners vs the more experimental new translations for Portability,
>>>> again more confusion.
>>>>
>>>> > FWIW I don't think "experimental" should be viewed as a bad thing. It
>>>> just means you are able to make backwards-incompatible changes, and that
>>>> users should be aware that they will need to adjust APIs (probably only a
>>>> little) with new releases. Most software is not very good until it has been
>>>> around for a long time, and in my experience the problem is missing the
>>>> mark on abstractions, so backwards compatibility *must* be broken to
>>>> achieve quality. Freezing it early dooms it to never achieving high
>>>> quality. I know of projects where the users explicitly requested that the
>>>> developers not freeze the API but instead prioritize speed an

Re: Timer support in Flink

2019-05-29 Thread Reza Rokni

Thanx Max!

Reza

On Wed, 29 May 2019, 16:38 Maximilian Michels,  wrote:

> Hi Reza,
>
> The detailed view of the capability matrix states: "The Flink Runner
> supports timers in non-merging windows."
>
> That is still the case. Other than that, timers should be working fine.
>
> > It makes very heavy use of Event.Time timers and has to do some manual
> DoFn cache work to get around some O(heavy) issues.
>
> If you are running on Flink 1.5, timer deletion suffers from O(n)
> complexity which has been fixed in newer versions.
>
> Cheers,
> Max
>
> On 29.05.19 03:27, Reza Rokni wrote:
> > Hi Flink experts,
> >
> > I am getting ready to push a PR around a utility class for
> timeseries join
> >
> > left.timestamp match to closest right.timestamp where right.timestamp <=
> > left.timestamp.
> >
> > It makes very heavy use of Event.Time timers and has to do some manual
> > DoFn cache work to get around some O(heavy) issues. Wanted to test
> > things against Flink: In the capability matrix we have "~" for Timer
> > support in Flink:
> >
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > Is that page outdated, if not what are the areas that still need to be
> > addressed please?
> >
> > Cheers
> >
> > Reza
> >
> >
> > --
> >
> > This email may be confidential and privileged. If you received this
> > communication by mistake, please don't forward it to anyone else, please
> > erase all copies and attachments, and please let me know that it has
> > gone to the wrong person.
> >
> > The above terms reflect a potential business arrangement, are provided
> > solely as a basis for further discussion, and are not intended to be and
> > do not constitute a legally binding obligation. No legally binding
> > obligations will be created, implied, or inferred until an agreement in
> > final form is executed in writing by all parties involved.
> >
>

Timer support in Flink

2019-05-28 Thread Reza Rokni

Hi Flink experts,

I am getting ready to push a PR around a utility class for timeseries join

left.timestamp match to closest right.timestamp where right.timestamp <=
left.timestamp.

It makes very heavy use of Event.Time timers and has to do some manual DoFn
cache work to get around some O(heavy) issues. Wanted to test things
against Flink: In the capability matrix we have "~" for Timer support in
Flink:

https://beam.apache.org/documentation/runners/capability-matrix/

Is that page outdated, if not what are the areas that still need to be
addressed please?

Cheers

Reza


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Hazelcast Jet Runner

2019-05-27 Thread Reza Rokni

"Many APIs that have been in place for years and are used by most Beam
users are still marked Experimental."

Should there be a formal process in place to start 'graduating' features
out of @Experimental? Perhaps even target an up coming release with a PR to
remove the annotation from well established API's?

On Tue, 28 May 2019 at 06:44, Reuven Lax  wrote:

> We generally use Experimental for two different things, which leads to
> confusion.
>   1. Features that work stably, but where we think we might still make
> some changes to the API.
>   2. New features that we think might not yet be stable.
>
> This dual usage leads to a lot of confusion IMO. The fact that we tend to
> forget to remove the @Experimental tag also makes it somewhat useless. Many
> APIs that have been in place for years and are used by most Beam users are
> still marked Experimental.
>
> Reuven
>
> On Mon, May 27, 2019 at 2:16 PM Ismaël Mejía  wrote:
>
>> > Personally, I think that it is good that moving from experimental to
>> non-experimental is a breaking change in the dependency - one has
>> backwards-incompatible changes and the other does not. If artifacts had
>> separate versioning we could use 0.x for this.
>>
>> In theory it seems so, but in practice it is an annoyance to an end
>> user that already took the ‘risk’ of using an experimental feature.
>> Awareness is probably not the most important reason to break existing
>> code (even if it could be easily fixed). The alternative of doing this
>> with version numbers at least seems less impacting but can be
>> confusing.
>>
>> > But biggest motivation for me are these:
>> >
>> >  - using experimental features should be opt-in
>> >  - should be impossible to use an experimental feature without knowing
>> it (so "opt-in" to a normal-looking feature is not enough)
>> > - developers of an experimental feature should be motivated to
>> "graduate" it
>>
>> The fundamental problem of this approach is inconsistency with our
>> present/past. So far we have ‘Experimental’ features everywhere. So
>> suddenly becoming opt-in let us in an inconsistent state. For example
>> all IOs are marked internally as Experimental but not at the level of
>> directories/artifacts. Adding this suffix in a new IO apart of adding
>> fear of use to the end users may also give the fake impression that
>> the older ones not explicitly marked are not experimental.
>>
>> What will be the state for example in the case of runner modules that
>> contain both mature and well tested runners like old Flink and Spark
>> runners vs the more experimental new translations for Portability,
>> again more confusion.
>>
>> > FWIW I don't think "experimental" should be viewed as a bad thing. It
>> just means you are able to make backwards-incompatible changes, and that
>> users should be aware that they will need to adjust APIs (probably only a
>> little) with new releases. Most software is not very good until it has been
>> around for a long time, and in my experience the problem is missing the
>> mark on abstractions, so backwards compatibility *must* be broken to
>> achieve quality. Freezing it early dooms it to never achieving high
>> quality. I know of projects where the users explicitly requested that the
>> developers not freeze the API but instead prioritize speed and quality.
>>
>> I agree 100% on the arguments, but let’s think in the reverse terms,
>> highlighting lack of maturity can play against the intended goal of
>> use and adoption even if for a noble reason. It is basic priming 101
>> [1].
>>
>> > Maybe the word is just too negative-sounding? Alternatives might be
>> "unstable" or "incubating".
>>
>> Yes! “experimental” should not be viewed as a bad thing unless you are
>> a company that has less resources and is trying to protect its
>> investment so in that case they may doubt to use it. In this case
>> probably incubating is a better term because it has less of the
>> ‘tentative’ dimension associated with Experimental.
>>
>> > Now, for the Jet runner, most runners sit on a branch for a while, not
>> being released at all, and move to master as their "graduation". I think
>> releasing under an "experimental" name is an improvement, making it
>> available to users to try out. But we probably should have discussed before
>> doing something different than all the other runners.
>>
>> There is something I don’t get in the case of Jet runner. From the
>> discussion in this thread it seems it has everything required to not
>> be ‘experimental’. It passes ValidatesRunner and can even run Nexmark
>> that’s more that some runners already merged in master, so I still
>> don’t get why we want to give it a different connotation.
>>
>> [1] https://en.wikipedia.org/wiki/Priming_(psychology)
>>
>> On Sun, May 26, 2019 at 4:43 AM Kenneth Knowles  wrote:
>> >
>> > Personally, I think that it is good that moving from experimental to
>> non-experimental is a breaking change in the dependency - one has
>>

Re: Shuffling on apache beam

2019-05-24 Thread Reza Rokni

PS You can also make use of the GlobalWindow with a stateful DoFn.

On Fri, 24 May 2019 at 15:13, Reza Rokni  wrote:

> Hi,
>
> Have you explored the use of triggers with your use case?
>
>
> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html
>
> Cheers
>
> Reza
>
> On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com <
> pasquale.bon...@gmail.com> wrote:
>
>> Hi Reuven,
>> I would like to know if is possible to guarantee that record are
>> processed by the same thread/task based on a key, as probably happens in a
>> combine/stateful operation, without adding the delay of a windows.
>> This could increase efficiency of caching and reduce same racing
>> condition when writing data.
>> I understand that workers are not part of programming model so I would
>> like to know if it's possible to achieve this behaviour reducing at minimum
>> the delay of windowing. We don't need any combine or state we just want the
>> all record with a given key are sent to same thread,
>>
>> Thanks
>>
>>
>> On 2019/05/24 03:20:13, Reuven Lax  wrote:
>> > Can you explain what you mean by worker? While every runner has workers
>> of
>> > course, workers are not part of the programming model.
>> >
>> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com <
>> > pasquale.bon...@gmail.com> wrote:
>> >
>> > > Hi all,
>> > > I would like to know if Apache Beam has a functionality similar to
>> > > fieldsGrouping in Storm that allows to send records to a specific
>> > > task/worker based on a key.
>> > > I know that we can achieve that with a combine/grouByKey operation but
>> > > that implies to add a windowing in our pipeline that we don't want.
>> > > I have also tried using a stateful transformation.
>> > > I think that also in that case we should use a windowing, but I see
>> that a
>> > > job with a stateful ParDo operation can be submitted  on Google
>> Dataflow
>> > > with windowing. I don't know if this depends  by lacking of support
>> for
>> > > stateful processing on Dataflow and if I can effetely achieve my goal
>> with
>> > > this solution.
>> > >
>> > >
>> > > Thanks in advance for your help
>> > >
>> > >
>> >
>>
>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Shuffling on apache beam

2019-05-24 Thread Reza Rokni

Hi,

Have you explored the use of triggers with your use case?

https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html

Cheers

Reza

On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com <
pasquale.bon...@gmail.com> wrote:

> Hi Reuven,
> I would like to know if is possible to guarantee that record are processed
> by the same thread/task based on a key, as probably happens in a
> combine/stateful operation, without adding the delay of a windows.
> This could increase efficiency of caching and reduce same racing condition
> when writing data.
> I understand that workers are not part of programming model so I would
> like to know if it's possible to achieve this behaviour reducing at minimum
> the delay of windowing. We don't need any combine or state we just want the
> all record with a given key are sent to same thread,
>
> Thanks
>
>
> On 2019/05/24 03:20:13, Reuven Lax  wrote:
> > Can you explain what you mean by worker? While every runner has workers
> of
> > course, workers are not part of the programming model.
> >
> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com <
> > pasquale.bon...@gmail.com> wrote:
> >
> > > Hi all,
> > > I would like to know if Apache Beam has a functionality similar to
> > > fieldsGrouping in Storm that allows to send records to a specific
> > > task/worker based on a key.
> > > I know that we can achieve that with a combine/grouByKey operation but
> > > that implies to add a windowing in our pipeline that we don't want.
> > > I have also tried using a stateful transformation.
> > > I think that also in that case we should use a windowing, but I see
> that a
> > > job with a stateful ParDo operation can be submitted  on Google
> Dataflow
> > > with windowing. I don't know if this depends  by lacking of support for
> > > stateful processing on Dataflow and if I can effetely achieve my goal
> with
> > > this solution.
> > >
> > >
> > > Thanks in advance for your help
> > >
> > >
> >
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reza Rokni

+100 As someone who has just spent a lot of time coding all the "GC,
caching of BagState fetches, etc" this would make life a lot easier!

Its also generally valuable for a lot of timeseries work.

On Fri, 24 May 2019 at 04:59, Lukasz Cwik  wrote:

>
>
> On Thu, May 23, 2019 at 1:53 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik  wrote:
>>
>>>
>>>
>>> On Thu, May 23, 2019 at 11:37 AM Rui Wang  wrote:
>>>
 A few obvious problems with this code:
>   1. Removing the elements already processed from the bag requires
> clearing and rewriting the entire bag. This is O(n^2) in the number of
> input trades.
>
 why it's not O(2 * n) to clearing and rewriting trade state?


> public interface SortedMultimapState extends State {
>   // Add a value to the map.
>   void put(K key, V value);
>   // Get all values for a given key.
>   ReadableState> get(K key);
>  // Return all entries in the map.
>   ReadableState>> allEntries();
>   // Return all entries in the map with keys <= limit. returned
> elements are sorted by the key.
>   ReadableState>> entriesUntil(K limit);
>
  // Remove all values with the given key;
>   void remove(K key);
>  // Remove all entries in the map with keys <= limit.
>   void removeUntil(K limit);
>
 Will removeUntilExcl(K limit) also useful? It will remove all entries
 in the map with keys < limit.


> Runners will sort based on the encoded value of the key. In order to
> make this easier for users, I propose that we introduce a new tag on 
> Coders
> *PreservesOrder*. A Coder that contains this tag guarantees that the
> encoded value preserves the same ordering as the base Java type.
>

 Could you clarify what is  "encoded value preserves the same ordering
 as the base Java type"?

>>>
>>> Lets say A and B represent two different instances of the same Java type
>>> like a double, then A < B (using the languages comparison operator) iff
>>> encode(A) < encode(B) (note the encoded versions are compared
>>> lexicographically)
>>>
>>
>> Since coders are shared across SDKs, do we expect A < B iff e(A) < e(P)
>> property to hold for all languages we support? What happens A, B sort
>> differently in different languages?
>>
>
> I assume it would need to mean that the property holds across all
> languages or the language has to use a specific wrapper type which honors
> the sorted order within that language. It is likely that the runner is
> doing the sorting and the runner may or may not be written in the same
> language the SDK is executing in.
>
>
>>
>>>
>>>

 -Rui

>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

Re: Better naming for runner specific options

2019-05-21 Thread Reza Rokni

Hi,

Coming back to this, is the general consensus that this can be addressed
via https://issues.apache.org/jira/browse/BEAM-6531 in Beam 3.0?

Cheers
Reza

On Tue, 7 May 2019 at 23:15, Valentyn Tymofieiev 
wrote:

> I think using RunnerOptions was an idea at some point, but in Python, we
> ended up parsing options from the runner api without populating
> RunnerOptions, and  RunnerOptions was eventually removed [1].
>
> If we decide to rename options, a path forward may be to have runners
> recognize both old and new names until Beam 3.0, but update codebase,
> examples and documentation to use new names.
>
> [1]
> https://github.com/apache/beam/commit/f3623e8ba2257f7659ccb312dc2574f862ef41b5#diff-525d5d65bedd7ea5e6fce6e4cd57e153L815
>
> *From:*Ahmet Altay 
> *Date:*Mon, May 6, 2019, 6:01 PM
> *To:*dev
>
> There is RunnerOptions already. Its options are populated by querying the
>> job service. Any portable runner is able to provide a list of options that
>> is runner specific through that mechanism.
>>
>> *From: *Reza Rokni 
>> *Date: *Mon, May 6, 2019 at 2:57 PM
>> *To: * 
>>
>> So the options here would be moved to runner options?
>>>
>>> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>>>
>>> In Java they are in DataflowPipelineWorkerPoolOptions and of course we
>>> have FlinkPipelineOptions etc...
>>>
>>> *From: *Chamikara Jayalath 
>>> *Date: *Tue, 7 May 2019 at 05:29
>>> *To: *dev
>>>
>>>
>>>> On Mon, May 6, 2019 at 2:13 PM Lukasz Cwik  wrote:
>>>>
>>>>> There were also discussions[1] in the past about scoping
>>>>> PipelineOptions to specific PTransforms. Would scoping PipelineOptions to
>>>>> PTransforms make this a more general solution?
>>>>>
>>>>> 1:
>>>>> https://lists.apache.org/thread.html/05f849d39788cb0af840cb9e86ca631586783947eb4e5a1774b647d1@%3Cdev.beam.apache.org%3E
>>>>>
>>>>
>>>> Is this just for pipeline construction time or also for runtime ?
>>>> Trying to scope options for transforms at runtime might complicate things
>>>> in the presence of optimizations such as fusion.
>>>>
>>>>
>>>>>
>>>>> On Mon, May 6, 2019 at 12:02 PM Ankur Goenka 
>>>>> wrote:
>>>>>
>>>>>> Having namespaces for option makes sense.
>>>>>> I think, along with a help command to print all the options given the
>>>>>> runner name will be useful.
>>>>>> As for the scope of name spacing, I think that assigning a logical
>>>>>> name space gives more flexibility around how and where we declare 
>>>>>> options.
>>>>>> It also make future refactoring possible.
>>>>>>
>>>>>>
>>>>>> On Mon, May 6, 2019 at 7:50 AM Maximilian Michels 
>>>>>> wrote:
>>>>>>
>>>>>>> Good points. As already mentioned there is no namespacing between
>>>>>>> the
>>>>>>> different pipeline option classes. In particular, there is no
>>>>>>> separate
>>>>>>> namespace for system and user options which is most concerning.
>>>>>>>
>>>>>>> I'm in favor of an optional namespace using the class name of the
>>>>>>> defining pipeline option class. That way we would at least be able
>>>>>>> to
>>>>>>> resolve duplicate option names. For example, if there were was
>>>>>>> "optionX"
>>>>>>> in class A and B, we could use "A#optionX" to refer to it from class
>>>>>>> A.
>>>>>>>
>>>>>>
>>>> I think this solves the original problem. Runner specific options will
>>>> have unique names that includes the runner (in options class). I guess to
>>>> be complete we also have to include the package (module for Python) ?
>>>> If an option is globally unique, users should be able to specify it
>>>> without qualifying (at least for backwards compatibility).
>>>>
>>>>
>>>>>
>>>>>>> -Max
>>>>>>>
>>>>>>> On 04.05.19 02:23, Reza Rokni wrote:
>>>>>>> > Great point Lukasz, worker machine could be relevant to multiple
>>>>>>

Re: SqlTransform Metadata

2019-05-21 Thread Reza Rokni

Hi,

Coming back to this do we have enough of a consensus to say that in
principle this is a good idea? If yes I will raise a Jira for this.

Cheers

Reza

On Thu, 16 May 2019 at 02:58, Robert Bradshaw  wrote:

> On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles  wrote:
> >
> > On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw 
> wrote:
> >>
> >> Isn't there an API for concisely computing new fields from old ones?
> >> Perhaps these expressions could contain references to metadata value
> >> such as timestamp. Otherwise,
> >
> > Even so, being able to refer to the timestamp implies something about
> its presence in a namespace, shared with other user-decided names.
>
> I was thinking that functions may live in a different namespace than
> fields.
>
> > And it may be nice for users to use that API within the composite
> SqlTransform. I think there are a lot of options.
> >
> >> Rather than withMetadata reifying the value as a nested field, with
> >> the timestamp, window, etc. at the top level, one could let it take a
> >> field name argument that attaches all the metadata as an extra
> >> (struct-like) field. This would be like attachX, but without having to
> >> have a separate method for every X.
> >
> > If you leave the input field names at the top level, then any "attach"
> style API requires choosing a name that doesn't conflict with input field
> names. You can't write a generic transform that works with all inputs. I
> think it is much simpler to move the input field all into a nested
> row/struct. Putting all the metadata in a second nested row/struct is just
> as good as top-level, perhaps. But moving the input into the struct/row is
> important.
>
> Very good point about writing generic transforms. It does mean a lot
> of editing if one decides one wants to access the metadata field(s)
> after-the-fact. (I also don't think we need to put the metadata in a
> nested struct if the value is.)
>
> >> It seems restrictive to only consider this a a special mode for
> >> SqlTransform rather than a more generic operation. (For SQL, my first
> >> instinct would be to just make this a special function like
> >> element_timestamp(), but there is some ambiguity there when there are
> >> multiple tables in the expression.)
> >
> > I would propose it as both: we already have some Reify transforms, and
> you could make a general operation that does this small data preparation
> easily. I think the proposal is just to add a convenience build method on
> SqlTransform to include the underlying functionality as part of the
> composite, which we really already have.
> >
> > I don't think we should extend SQL with built-in functions for
> element_timestamp() and things like that, because SQL already has TIMESTAMP
> columns and it is very natural to use SQL on unbounded relations where the
> timestamp is just part of the data.
>
> That's why I was suggesting a single element_metadata() rather than
> exploding each one out.
>
> Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming
> this is a special field, but distinct from the metadata timestamp?)
>
> >> On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
> >> >
> >> > Hi,
> >> >
> >> > One use case would be when dealing with the windowing functions for
> example:
> >> >
> >> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR)
> tumble_start
> >> >   FROM PCOLLECTION
> >> >   GROUP BY
> >> > f_int,
> >> > TUMBLE(f_timestamp, INTERVAL '1' HOUR)
> >> >
> >> > For an element which is using Metadata to inform the EvenTime of the
> element, rather than data within the element itself, I would need to create
> a new schema which added the timestamp as a field. I think other examples
> which maybe interesting is getting the value of a row with the max/min
> timestamp. None of this would be difficult but it does feel a little on the
> verbose side and also makes the pipeline a little harder to read.
> >> >
> >> > Cheers
> >> > Reza
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > From: Kenneth Knowles 
> >> > Date: Wed, 15 May 2019 at 01:15
> >> > To: dev
> >> >
> >> >> We have support for nested rows so this should be easy. The
> .withMetadata would reify the struct, moving from Row to WindowedValue
> if I understand it...
> >> >>
> >> >> SqlTransform.query("SELECT f

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Reza Rokni

Hi Jan,

It's a super interesting use case you have and has a lot of similarity with
complexity that comes up when dealing with time series problems.

I wonder if it would be interesting to see if the pattern generalises
enough to make some utility classes abstracting the complexity from the
user.

Cheers

Reza

On Tue, 21 May 2019, 20:13 Jan Lukavský,  wrote:

> Hi Reza,
>
> I think it probably would provide enough compression. But it would
> introduce complications and latency for the streaming case. Although I see
> your point, I was trying to figure out if the Beam model should support
> these use cases more "natively".
>
> Cheers,
>
>  Jan
> On 5/21/19 11:03 AM, Reza Rokni wrote:
>
> In a lot of cases the initial combiner can dramatically reduce the amount
> of data in this last phase making it tractable for a lot of use cases.
>
>  I assume in your example the first phase would not provide enough
> compression?
>
> Cheers
>
> Reza
>
> On Tue, 21 May 2019, 16:47 Jan Lukavský,  wrote:
>
>> Hi Reza, thanks for reaction, comments inline.
>> On 5/21/19 1:02 AM, Reza Rokni wrote:
>>
>> Hi,
>>
>> If I have understood the use case correctly, your output is an ordered
>> counter of state changes.
>>
>> One approach  which might be worth exploring is outlined below, haven't
>> had a chance to test it so could be missing pieces or be plane old wrong (
>> will try and come up with a test example later on to try it out).
>>
>> 1 - Window into a small enough Duration such that the number of
>> elements in a window per key can be read into memory structure for sorting.
>>
>> 2 - GBK
>> 3 - In a DoFn do the ordering and output a Timestamped elements that
>> contain the state changes for just that window and the value of the last
>> element  {timestamp-00:00:00: (one: 1, zero: 0, lastElement : 0)}. This
>> will cause memory pressure so your step 1 is important.
>>
>> This is just an optimization, right?
>>
>> 4- Window these outputs into the Global Window with a Stateful DoFn
>>
>> Because you finally have to do the stateful ParDo in Global window, you
>> will end up with the same problem - the first three steps just might give
>> you some extra time. But if you have enough data (long enough history, of
>> very frequent changes, or both), then you will run into the same issues as
>> without the optimization here. The BagState simply would not be able to
>> hold all the data in batch case.
>>
>> Jan
>>
>> 5-  Add elements to a BagState in the stateful dofn
>> 6 - In the Global Window set an EventTimer to fire at time boundaries
>> that match the time window that you need. Note Timers do not have a read
>> function for the time that they are set. (Here is one way to set
>> metadata to emulate a read function
>> <https://stackoverflow.com/questions/55912522/setting-a-timer-to-the-minimum-timestamp-seen/55912542#55912542>)
>> Again this can cause memory pressure.
>> 7 - At each OnTimer,
>> 7a-  read and sort the elements in the BagState,
>> 7b - True up the state changes with the cross-window state changes from
>> the list.
>> 7c - Store the last accumulator into a different State
>>
>> Sorry that was off the top of my head so could be missing things. For
>> example LateData would need to be dealt with outside of this flow...
>>
>> Cheers
>> Reza
>>
>> On Tue, 21 May 2019 at 07:00, Kenneth Knowles  wrote:
>>
>>> Thanks for the nice small example of a calculation that depends on
>>> order. You are right that many state machines have this property. I agree
>>> w/ you and Luke that it is convenient for batch processing to sort by event
>>> timestamp before running a stateful ParDo. In streaming you could also
>>> implement "sort by event timestamp" by buffering until you know all earlier
>>> data will be dropped - a slack buffer up to allowed lateness.
>>>
>>> I do not think that it is OK to sort in batch and not in streaming. Many
>>> state machines diverge very rapidly when things are out of order. So each
>>> runner if they see the "@OrderByTimestamp" annotation (or whatever) needs
>>> to deliver sorted data (by some mix of buffering and dropping), or to
>>> reject the pipeline as unsupported.
>>>
>>> And also want to say that this is not the default case - many uses of
>>> state & timers in ParDo yield different results at the element level, but
>>> the results are equivalent at in the big picture. Such as the example of
>>> "

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Reza Rokni

In a lot of cases the initial combiner can dramatically reduce the amount
of data in this last phase making it tractable for a lot of use cases.

 I assume in your example the first phase would not provide enough
compression?

Cheers

Reza

On Tue, 21 May 2019, 16:47 Jan Lukavský,  wrote:

> Hi Reza, thanks for reaction, comments inline.
> On 5/21/19 1:02 AM, Reza Rokni wrote:
>
> Hi,
>
> If I have understood the use case correctly, your output is an ordered
> counter of state changes.
>
> One approach  which might be worth exploring is outlined below, haven't
> had a chance to test it so could be missing pieces or be plane old wrong (
> will try and come up with a test example later on to try it out).
>
> 1 - Window into a small enough Duration such that the number of
> elements in a window per key can be read into memory structure for sorting.
>
> 2 - GBK
> 3 - In a DoFn do the ordering and output a Timestamped elements that
> contain the state changes for just that window and the value of the last
> element  {timestamp-00:00:00: (one: 1, zero: 0, lastElement : 0)}. This
> will cause memory pressure so your step 1 is important.
>
> This is just an optimization, right?
>
> 4- Window these outputs into the Global Window with a Stateful DoFn
>
> Because you finally have to do the stateful ParDo in Global window, you
> will end up with the same problem - the first three steps just might give
> you some extra time. But if you have enough data (long enough history, of
> very frequent changes, or both), then you will run into the same issues as
> without the optimization here. The BagState simply would not be able to
> hold all the data in batch case.
>
> Jan
>
> 5-  Add elements to a BagState in the stateful dofn
> 6 - In the Global Window set an EventTimer to fire at time boundaries that
> match the time window that you need. Note Timers do not have a read
> function for the time that they are set. (Here is one way to set metadata
> to emulate a read function
> <https://stackoverflow.com/questions/55912522/setting-a-timer-to-the-minimum-timestamp-seen/55912542#55912542>)
> Again this can cause memory pressure.
> 7 - At each OnTimer,
> 7a-  read and sort the elements in the BagState,
> 7b - True up the state changes with the cross-window state changes from
> the list.
> 7c - Store the last accumulator into a different State
>
> Sorry that was off the top of my head so could be missing things. For
> example LateData would need to be dealt with outside of this flow...
>
> Cheers
> Reza
>
> On Tue, 21 May 2019 at 07:00, Kenneth Knowles  wrote:
>
>> Thanks for the nice small example of a calculation that depends on order.
>> You are right that many state machines have this property. I agree w/ you
>> and Luke that it is convenient for batch processing to sort by event
>> timestamp before running a stateful ParDo. In streaming you could also
>> implement "sort by event timestamp" by buffering until you know all earlier
>> data will be dropped - a slack buffer up to allowed lateness.
>>
>> I do not think that it is OK to sort in batch and not in streaming. Many
>> state machines diverge very rapidly when things are out of order. So each
>> runner if they see the "@OrderByTimestamp" annotation (or whatever) needs
>> to deliver sorted data (by some mix of buffering and dropping), or to
>> reject the pipeline as unsupported.
>>
>> And also want to say that this is not the default case - many uses of
>> state & timers in ParDo yield different results at the element level, but
>> the results are equivalent at in the big picture. Such as the example of
>> "assign a unique sequence number to each element" or "group into batches"
>> it doesn't matter exactly what the result is, only that it meets the spec.
>> And other cases like user funnels are monotonic enough that you also don't
>> actually need sorting.
>>
>> Kenn
>>
>> On Mon, May 20, 2019 at 2:59 PM Jan Lukavský  wrote:
>>
>>> Yes, the problem will arise probably mostly when you have not well
>>> distributed keys (or too few keys). I'm really not sure if a pure GBK with
>>> a trigger can solve this - it might help to have data driven trigger. There
>>> would still be some doubts, though. The main question is still here -
>>> people say, that sorting by timestamp before stateful ParDo would be
>>> prohibitively slow, but I don't really see why - the sorting is very
>>> probably already there. And if not (hash grouping instead of sorted
>>> grouping), then the sorting would affect only user defined StatefulParDos.
>>>
>>

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Reza Rokni

Hi,

If I have understood the use case correctly, your output is an ordered
counter of state changes.

One approach  which might be worth exploring is outlined below, haven't had
a chance to test it so could be missing pieces or be plane old wrong ( will
try and come up with a test example later on to try it out).

1 - Window into a small enough Duration such that the number of elements in
a window per key can be read into memory structure for sorting.
2 - GBK
3 - In a DoFn do the ordering and output a Timestamped elements that
contain the state changes for just that window and the value of the last
element  {timestamp-00:00:00: (one: 1, zero: 0, lastElement : 0)}. This
will cause memory pressure so your step 1 is important.
4- Window these outputs into the Global Window with a Stateful DoFn
5-  Add elements to a BagState in the stateful dofn
6 - In the Global Window set an EventTimer to fire at time boundaries that
match the time window that you need. Note Timers do not have a read
function for the time that they are set. (Here is one way to set metadata
to emulate a read function
)
Again this can cause memory pressure.
7 - At each OnTimer,
7a-  read and sort the elements in the BagState,
7b - True up the state changes with the cross-window state changes from the
list.
7c - Store the last accumulator into a different State

Sorry that was off the top of my head so could be missing things. For
example LateData would need to be dealt with outside of this flow...

Cheers
Reza

On Tue, 21 May 2019 at 07:00, Kenneth Knowles  wrote:

> Thanks for the nice small example of a calculation that depends on order.
> You are right that many state machines have this property. I agree w/ you
> and Luke that it is convenient for batch processing to sort by event
> timestamp before running a stateful ParDo. In streaming you could also
> implement "sort by event timestamp" by buffering until you know all earlier
> data will be dropped - a slack buffer up to allowed lateness.
>
> I do not think that it is OK to sort in batch and not in streaming. Many
> state machines diverge very rapidly when things are out of order. So each
> runner if they see the "@OrderByTimestamp" annotation (or whatever) needs
> to deliver sorted data (by some mix of buffering and dropping), or to
> reject the pipeline as unsupported.
>
> And also want to say that this is not the default case - many uses of
> state & timers in ParDo yield different results at the element level, but
> the results are equivalent at in the big picture. Such as the example of
> "assign a unique sequence number to each element" or "group into batches"
> it doesn't matter exactly what the result is, only that it meets the spec.
> And other cases like user funnels are monotonic enough that you also don't
> actually need sorting.
>
> Kenn
>
> On Mon, May 20, 2019 at 2:59 PM Jan Lukavský  wrote:
>
>> Yes, the problem will arise probably mostly when you have not well
>> distributed keys (or too few keys). I'm really not sure if a pure GBK with
>> a trigger can solve this - it might help to have data driven trigger. There
>> would still be some doubts, though. The main question is still here -
>> people say, that sorting by timestamp before stateful ParDo would be
>> prohibitively slow, but I don't really see why - the sorting is very
>> probably already there. And if not (hash grouping instead of sorted
>> grouping), then the sorting would affect only user defined StatefulParDos.
>>
>> This would suggest that the best way out of this would be really to add
>> annotation, so that the author of the pipeline can decide.
>>
>> If that would be acceptable I think I can try to prepare some basic
>> functionality, but I'm not sure, if I would be able to cover all runners /
>> sdks.
>> On 5/20/19 11:36 PM, Lukasz Cwik wrote:
>>
>> It is read all per key and window and not just read all (this still won't
>> scale with hot keys in the global window). The GBK preceding the
>> StatefulParDo will guarantee that you are processing all the values for a
>> specific key and window at any given time. Is there a specific
>> window/trigger that is missing that you feel would remove the need for you
>> to use StatefulParDo?
>>
>> On Mon, May 20, 2019 at 12:54 PM Jan Lukavský  wrote:
>>
>>> Hi Lukasz,
>>>
>>> > Today, if you must have a strict order, you must guarantee that your
>>> StatefulParDo implements the necessary "buffering & sorting" into state.
>>>
>>> Yes, no problem with that. But this whole discussion started, because
>>> *this doesn't work on batch*. You simply cannot first read everything from
>>> distributed storage and then buffer it all into memory, just to read it
>>> again, but sorted. That will not work. And even if it would, it would be a
>>> terrible waste of resources.
>>>
>>> Jan
>>> On 5/20/19 8:39 PM, Lukasz Cwik wrote:
>>>
>>>
>>>
>>> On Mon, May 20, 2019 at

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-14 Thread Reza Rokni

Awesome news :-)

*From: *Kenneth Knowles 
*Date: *Wed, 15 May 2019, 11:25
*To: *dev

Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada to
> join the PMC.
>
> Pablo first picked up BEAM-722 in October of 2016 and has been a steady
> part of the Beam community since then. In addition to technical work on
> Beam Python & Java & runners, I would highlight how Pablo grows Beam's
> community by helping users, working on GSoC, giving talks at Beam Summits
> and other OSS conferences including Flink Forward, and holding training
> workshops. I cannot do justice to Pablo's contributions in a single
> paragraph.
>
> Thanks Pablo, for being a part of Beam.
>
> Kenn
>

Re: SqlTransform Metadata

2019-05-14 Thread Reza Rokni

Hi,

One use case would be when dealing with the windowing functions for example:

SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR)
tumble_start
  FROM PCOLLECTION
  GROUP BY
f_int,
TUMBLE(f_timestamp, INTERVAL '1' HOUR)

For an element which is using Metadata to inform the EvenTime of the
element, rather than data within the element itself, I would need to create
a new schema which added the timestamp as a field. I think other examples
which maybe interesting is getting the value of a row with the max/min
timestamp. None of this would be difficult but it does feel a little on the
verbose side and also makes the pipeline a little harder to read.

Cheers
Reza





*From: *Kenneth Knowles 
*Date: *Wed, 15 May 2019 at 01:15
*To: *dev

We have support for nested rows so this should be easy. The .withMetadata
> would reify the struct, moving from Row to WindowedValue if I
> understand it...
>
> SqlTransform.query("SELECT field1 from PCOLLECTION"):
>
> Schema = {
>   field1: type1,
>   field2: type2
> }
>
> SqlTransform.query(...)
>
> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1
> FROM PCOLLECTION")
>
> Derived schema = {
>   event_timestamp: TIMESTAMP,
>   pane_info: { ... }
>   value: {
> field1: type1,
> field2: type2,
> ...
>   }
> }
>
> SqlTransform would expand into a different composite, and it would be a
> straightforward ParDo to adjust the data, possibly automatic via the new
> schema conversions.
>
> Embedding the window would be a bit wonky, something like { end_of_window:
> TIMESTAMP, encoded_window: bytes } which would be expensive due to
> encoding. But timestamp and pane info not so bad.
>
> Kenn
>
> *From: *Anton Kedin 
> *Date: *Tue, May 14, 2019 at 9:17 AM
> *To: * 
>
> Reza, can you share more thoughts on how you think this can work
>> end-to-end?
>>
>> Currently the approach is that populating the rows with the data happens
>> before the SqlTransform, and within the query you can only use the
>> things that are already in the rows or in the catalog/schema (or built-in
>> things). In general case populating the rows with any data can be solved
>> via a ParDo before SqlTransform. Do you think this approach lacks something
>> or maybe too verbose?
>>
>> My thoughts on this, lacking more info or concrete examples: in order to
>> access a timestamp value from within a query there has to be a syntax for
>> it. Field access expressions or function calls are the only things that
>> come to mind among existing syntax features that would allow that. Making
>> timestamp a field of the data row makes more sense to me here because in
>> Beam it is already a part of the element. It's not a result of a function
>> call and it's already easily accessible, doesn't make sense to build extra
>> functions here. One of the problems with both approaches however is the
>> potential conflicts with the existing schema of the data elements (or the
>> schema/catalog of the data source in general). E.g. if we add a magical
>> "event_timestamp" column or "event_timestamp()" function there may
>> potentially already exist a field or a function in the schema with this
>> name. This can be solved in couple of ways, but we will probably want to
>> provide a configuration mechanism to assign a different field/function
>> names in case of conflicts.
>>
>> Given that, it may make sense to allow users to attach the whole pane
>> info or some subset of it to the row (e.g. only the timestamp), and make
>> that configurable. However I am not sure whether exposing something like
>> pane info is enough and will cover a lot of useful cases. Plus adding
>> methods like `attachTimestamp("fieldname")` or
>> `attachWindowInfo("fieldname")` might open a portal to ever-increasing
>> collection of these `attachX()`, `attachY()` that can make the API less
>> usable. If on the other hand we would make it more generic then it will
>> probably have to look a lot like a ParDo or MapElements.via() anyway. And
>> at that point the question would be whether it makes sense to build
>> something extra that probably looks and functions like an existing feature.
>>
>> Regards,
>> Anton
>>
>>
>>
>> *From: *Andrew Pilloud 
>> *Date: *Tue, May 14, 2019 at 7:29 AM
>> *To: *dev
>>
>> Hi Reza,
>>>
>>> Where will this metadata be coming from? Beam SQL is tightly coupled
>>> with the schema of the PCollection, so adding fields not in the data wou

Re: Contributing Beam Kata (Java & Python)

2019-05-14 Thread Reza Rokni

+1 :-)

*From: *Lukasz Cwik 
*Date: *Wed, 15 May 2019 at 04:29
*To: *dev
*Cc: *Lars Francke

+1
>
> *From: *Pablo Estrada 
> *Date: *Tue, May 14, 2019 at 1:27 PM
> *To: *dev
> *Cc: *Lars Francke
>
> +1 on merging.
>>
>> *From: *Reuven Lax 
>> *Date: *Tue, May 14, 2019 at 1:23 PM
>> *To: *dev
>> *Cc: *Lars Francke
>>
>> I've been playing around with this that past day .or two, and it's great!
>>> I'm inclined to merge this PR (if nobody objects) so that others in the
>>> community can contribute more training katas.
>>>
>>> Reuven
>>>
>>> *From: *Ismaël Mejía 
>>> *Date: *Tue, Apr 23, 2019 at 6:43 AM
>>> *To: *Lars Francke
>>> *Cc: * 
>>>
>>> Thanks for answering Lars,

 The 'interesting' part is that the tutorial has a full IDE integrated
 experience based on the Jetbrains edu platform [1]. So maybe
 interesting to see if it could make sense to have projects like this
 in the new trainings incubator project or if they became too platform
 constrained.

 This contribution is valuable for Beam but the community may decide
 that it makes sense for it to live at some moment at the trainings
 project. I suppose also Henry could be interested in taking a look at
 this [2].

 [1] https://www.jetbrains.com/education/
 [2] https://incubator.apache.org/clutch/training.html

 On Tue, Apr 23, 2019 at 3:00 PM Lars Francke 
 wrote:
 >
 > Thanks Ismaël.
 >
 > I must admit I'm a tad confused. What has JetBrains got to do with
 this?
 > This looks pretty cool and specific to Beam though, or is this more
 generic?
 > But yeah something along those lines could be interesting for
 hands-on type things in training.
 >
 > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía 
 wrote:
 >>
 >> +lars.fran...@gmail.com who is in the Apache training project and
 may
 >> be interested in this one or at least the JetBrains like approach.
 >>
 >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía 
 wrote:
 >> >
 >> > This looks great, nice for bringing this to the project Henry!
 >> >
 >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
 >> >  wrote:
 >> > >
 >> > > Thanks Altay.
 >> > > I'll create it under "learning/" first as this is not exactly
 example.
 >> > > Please do let me know if it's not the right place.
 >> > >
 >> > > On 2019/04/18 22:49:47, Ahmet Altay  wrote:
 >> > > > This looks great.
 >> > > >
 >> > > > +David Cavazos  was working on
 interactive colab based
 >> > > > examples (https://github.com/apache/beam/pull/7679) perhaps
 we can have a
 >> > > > shared place for these two similar things.
 >> > > >
 >> > >

>>>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

SqlTransform Metadata

2019-05-14 Thread Reza Rokni

Hi,

What are folks thoughts about adding something like
SqlTransform.withMetadata().query(...)to enable users to be able to access
things like Timestamp information from within the query without having to
refiy the information into the element itself?

Cheers
Reza



-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.

1 2 >

1 - 100 of 113 matches

Mail list logo