Note that even if you use GroupByKey and a 1 second window, it could be
that key K at time T1 and T2 are scheduled to be processed in parallel
which means that you will still need locking.

Apache Beam has no transform which allows you to partition the data how you
want without using synchronization/locking/... unless your underlying
storage engine allows you to pass in user specified version numbers which
then you could use the windowing information to produce larger and larger
version numbers so the storage engine would know which write it should keep
and which write it should discard.

Alternatively, if you know which runner you want to use, it may be that
intrinsically via some execution properties of the runner you ca get what
you need but you'll have a pipeline which isn't following strict Apache
Beam semantics and if the runner was to change, it may break you.

Finally, if none of that works out, you'll want to use a stream processing
engine that allows you to specifically say that any key range should only
ever be processed on a single machine at a time. This can have lots of its
own problems if you hit a hot key since one machine will be swamped
processing while the others are relatively idle.


On Fri, Jul 6, 2018 at 8:13 AM Jean-Baptiste Onofré <[email protected]> wrote:

> Hi Niels,
>
> as you have an Unbounded PCollection, you need a Window to GroupByKey
> and then "forward" the data.
>
> Another option would be to use a DoFn working element per element and
> eventually batching then. It's what most of the IOs are doing for the
> Write part.
>
> Regards
> JB
>
> On 06/07/2018 17:01, Niels Basjes wrote:
> > Hi,
> >
> > I have an unbounded stream of change events each of which has the id of
> > the entity that is changed.
> > To avoid the need for locking in the persistence layer that is needed in
> > part of my processing I want to route all events based on this entity id.
> > That way I know for sure that all events around a single entity go
> > through the same instance of my processing sequentially, hence no need
> > for locking or other synchronization regarding this persistence.
> >
> > At this point my best guess is that I need to use the GroupByKey but
> > that seems to need a Window.
> > I think I don't want a window because the stream is unbounded and I want
> > the lowest possible latency (i.e. a Window of 1 second would be ok for
> > this usecase).
> > Also I want to be 100% sure that all events for a specific id go to only
> > a single instance because I do not want any race conditions.
> >
> > My simple question is: What does that look like in the Beam Java API?
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to