Re: Regarding Broadcast of datasets in streaming context

Gyula Fóra Thu, 28 Apr 2016 07:24:47 -0700

Hi Biplob,

I have implemented a similar algorithm as Aljoscha mentioned.

First things to clarify are the following:
There is currently no abstraction for keeping objects (in you case
centroids) in a centralized way that can be updated/read by all operators.
This would probably be very costly and is actually not necessary in your
case.

Broadcast a stream in contrast with other partitioning methods mean that
the events will be replicated to all downstream operators. This not a
magical operator that will make state available among parallel instances.

Now let me explain what I think you want from Flink and how to do it :)

You have input data stream and a set of centroids to be updated based on
the incoming records. As you want to do this in parallel you have an
operator (let's say a flatmap) that keeps the centroids locally and updates
it on it's inputs. Now you have a set of independently updated centroids,
so you want to merge them and update the centroids in each flatmap.

Let's see how to do this. Given that you have your centroids locally,
updating them is super easy, so I will not talk about that. The problematic
part is periodically merging end "broadcasting" the centroids so all the
flatmaps eventually see the same (they don't have to always be the same for
clustering probably). There is no operator for sending state (centroids)
between subtasks so you have to be clever here. We can actually use cyclic
streams to solve this problem by sending the centroids as simple events to
a CoFlatMap:

DataStream<Point> input = ...
ConnectedIterativeStreams<Point, Centroids> inputsAndCentroids =
input.iterate().withFeedbackType(Centroids.class)
DataStream<Centroids> updatedCentroids =
inputsAndCentroids.flatMap(MyCoFlatmap)
inputsAndCentroids.closeWith(updatedCentroids.broadcast())

MyCoFlatmap would be a CoFlatMapFunction which on 1 input receive events
and update its local centroids (and periodically output the centroids) and
on the other input would send centroids of other flatmaps and would merge
them to the local.

This might be a lot to take in at first, so you might want to read up on
streaming iterations and connected streams before you start.

Let me know if this makes sense.

Cheers,
Gyula

Biplob Biswas <revolutioni...@gmail.com> ezt írta (időpont: 2016. ápr. 28.,
Cs, 14:41):

> That would really be great, any example would help me proceed with my work.
> Thanks a lot.
>
>
> Aljoscha Krettek wrote
> > Hi Biplob,
> > one of our developers had a stream clustering example a while back. It
> was
> > using a broadcast feedback edge with a co-operator to update the
> > centroids.
> > I'll directly include him in the email so that he will notice and can
> send
> > you the example.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 28 Apr 2016 at 13:57 Biplob Biswas &lt;
>
> > revolutionisme@
>
> > &gt; wrote:
> >
> >> I am pretty new to flink systems, thus can anyone atleast give me an
> >> example
> >> of how datastream.broadcast() method works? From the documentation i get
> >> the
> >> following:
> >>
> >> broadcast()
> >> Sets the partitioning of the DataStream so that the output elements are
> >> broadcasted to every parallel instance of the next operation.
> >>
> >> If the output elements are broadcasted, then how are they retrieved? Or
> >> maybe I am looking at this method in a completely wrong way?
> >>
> >> Thanks
> >> Biplob Biswas
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6543.html
> >> Sent from the Apache Flink User Mailing List archive. mailing list
> >> archive
> >> at Nabble.com.
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-tp6456p6548.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: Regarding Broadcast of datasets in streaming context

Reply via email to