+dev <d...@beam.apache.org>@beam and some people who I talk about joins with

Interesting! It is a lot to take in and fully grok the code, so calling in
reinforcements...

Generally, I think there's agreement that for a lot of real use cases, you
have to roll your own join using the lower level Beam primitives. So I
think it would be great to get some of these other approaches to joins into
Beam, perhaps as an extension of the Java SDK or even in the core (since
schema joins are in the core). In particular:

 - "join in fixed window with repeater" sounds similar (but not identical)
to work by Mikhail
 - "join in global window with cache" sounds similar (but not identical) to
work and discussions w/ Reza and Tyson

I want to be clear that I am *not* saying there's any duplication. I'm
guessing these all fit into a collection of different ways to accomplish
joins, and if everything comes to fruition we will have the great
opportunity to document how a user should choose between them.

Kenn

On Fri, May 1, 2020 at 7:56 AM Marcin Kuthan <marcin.kut...@gmail.com>
wrote:

> Hi,
>
> it's my first post here but I'm a group reader for a while, so thank you
> for sharing the knowledge!
>
> I've been using Beam/Scio on Dataflow for about a year, mostly for stream
> processing from unbounded source like PubSub. During my daily work I found
> that built-in windowing is very generic and provides reach watermark/late
> events semantics but there are a few very annoying limitations, e.g:
> - both side of the join must be defined within compatible windows
> - for fixed windows, elements close to window boundaries (but in different
> windows) won't be joined
> - for sliding windows there is a huge overhead if the duration is much
> longer than offset
>
> I would like to ask you to review a few "join/windowing patterns" with
> custom stateful ParDos, not so generic as Beam built-ins but perhaps better
> crafted for more specific needs. I published code with tests, feel free to
> comment as GitHub issues or on the mailing list. The event time processing
> with watermarks is so demanding that I'm almost sure that I overlooked many
> important corner cases.
> https://github.com/mkuthan/beam-examples
>
> If you think that the examples are somehow useful I'll be glad to write
> blog post with more details :)
>
> Marcin
>

Reply via email to