Reshuffle for the purpose of ensuring stable inputs is deprecated, but this
seems a valid(ish) usecase.

Currently, all runners implement GroupByKey by sending everything with the
same key to the same machine, however this is not guaranteed by the Beam
model, and changing it has been tossed around (e.g. when using fixed
windows, one could send different key-window pairs to different machines).
Even when this is the case, there's no guarantee that there won't be backup
workers processing the same key, or even if the runner doesn't do backups,
zombie workers (e.g. where we thought the worker died, and allocated its
work elsewhere, but it turns out the worker is till churning away possibly
causing side effects). Such is the nature of a distributed system.

If you go the GBK route, rather than windowing by 1 second, you could us
an AfterPane.elementCountAtLeast(1) trigger [1] (even in the global window)
for absolute minimal latency.
https://beam.apache.org/documentation/programming-guide/#data-driven-triggers
. This is essentially what Reshuffle does.

- Robert


On Fri, Jul 6, 2018 at 11:11 PM Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Raghu,
>
> AFAIR, Reshuffle is considered as deprecated. Maybe it would be better
> to avoid to use it no ?
>
> Regards
> JB
>
> On 06/07/2018 18:28, Raghu Angadi wrote:
> > I would use Reshuffle()[1] with entity id as the key. It internally does
> > a GroupByKey and sets up windowing such that it does not buffer anything.
> >
> > [1]
> > :
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L51
>
> >
> > On Fri, Jul 6, 2018 at 8:01 AM Niels Basjes <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     I have an unbounded stream of change events each of which has the id
> >     of the entity that is changed.
> >     To avoid the need for locking in the persistence layer that is
> >     needed in part of my processing I want to route all events based on
> >     this entity id.
> >     That way I know for sure that all events around a single entity go
> >     through the same instance of my processing sequentially, hence no
> >     need for locking or other synchronization regarding this persistence.
> >
> >     At this point my best guess is that I need to use the GroupByKey but
> >     that seems to need a Window.
> >     I think I don't want a window because the stream is unbounded and I
> >     want the lowest possible latency (i.e. a Window of 1 second would be
> >     ok for this usecase).
> >     Also I want to be 100% sure that all events for a specific id go to
> >     only a single instance because I do not want any race conditions.
> >
> >     My simple question is: What does that look like in the Beam Java API?
> >
> >     --
> >     Best regards / Met vriendelijke groeten,
> >
> >     Niels Basjes
> >
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to